iza.ch1
2013-Jul-29 16:39 UTC
[R] replace Na values with the mean of the column which contains them
Hi everyone I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data de [,1] [,2] [,3] [1,] NA -0.26928087 -0.1192078 [2,] NA 1.20925752 0.9325334 [3,] NA 0.38012008 -1.8927164 [4,] NA -0.41778861 1.4330507 [5,] NA -0.49677462 0.2892706 [6,] NA -0.13248754 1.3976522 [7,] NA -0.54179054 0.2295291 [8,] NA 0.35788624 -0.5009389 [9,] 0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] 0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 1.4729576 [15,] 0.20805934 0.25974308 -1.6093478 [16,] 0.03036708 -0.04013730 0.1686006 and I wrote the code de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) I get as the result [,1] [,2] [,3] [1,] -0.50575168 -0.26928087 -0.1192078 [2,] -0.12222376 1.20925752 0.9325334 [3,] -0.13412312 0.38012008 -1.8927164 [4,] -0.50575168 -0.41778861 1.4330507 [5,] -0.12222376 -0.49677462 0.2892706 [6,] -0.13412312 -0.13248754 1.3976522 [7,] -0.50575168 -0.54179054 0.2295291 [8,] -0.12222376 0.35788624 -0.5009389 [9,] 0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] 0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 1.4729576 [15,] 0.20805934 0.25974308 -1.6093478 [16,] 0.03036708 -0.04013730 0.1686006 It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. I want to have the result like this: [,1] [,2] [,3] [1,] -0.50575168 -0.26928087 -0.1192078 [2,] -0.50575168 1.20925752 0.9325334 [3,] -0.50575168 0.38012008 -1.8927164 [4,] -0.50575168 -0.41778861 1.4330507 [5,] -0.50575168 -0.49677462 0.2892706 [6,] -0.50575168 -0.13248754 1.3976522 [7,] -0.50575168 -0.54179054 0.2295291 [8,] -0.50575168 0.35788624 -0.5009389 [9,] 0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] 0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 1.4729576 [15,] 0.20805934 0.25974308 -1.6093478 [16,] 0.03036708 -0.04013730 0.1686006 Thanks in advance
Berend Hasselman
2013-Jul-29 17:27 UTC
[R] replace Na values with the mean of the column which contains them
On 29-07-2013, at 18:39, "iza.ch1" <iza.ch1 at op.pl> wrote:> Hi everyone > > I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data > > de > [,1] [,2] [,3] > [1,] NA -0.26928087 -0.1192078 > [2,] NA 1.20925752 0.9325334 > [3,] NA 0.38012008 -1.8927164 > [4,] NA -0.41778861 1.4330507 > [5,] NA -0.49677462 0.2892706 > [6,] NA -0.13248754 1.3976522 > [7,] NA -0.54179054 0.2295291 > [8,] NA 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > and I wrote the code > de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) > > I get as the result > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.12222376 1.20925752 0.9325334 > [3,] -0.13412312 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.12222376 -0.49677462 0.2892706 > [6,] -0.13412312 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.12222376 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. > I want to have the result like this: > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.50575168 1.20925752 0.9325334 > [3,] -0.50575168 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.50575168 -0.49677462 0.2892706 > [6,] -0.50575168 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.50575168 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006This seems to do what you want: library(plyr) de.res <- t(aaply(de,2,.fun=function(x) {x[which(is.na(x))] <- mean(x,na.rm=TRUE);x})) dimnames(de.res) <- NULL Berend
John Fox
2013-Jul-29 17:29 UTC
[R] replace Na values with the mean of the column which contains them
Dear iza.ch1, I hesitate to say this, because mean imputation is such a bad idea, but it's easy to do what you want with a loop, rather than puzzling over a "cleverer" way to accomplish the task. Here's an example using the Freedman data set in the car package:> colSums(is.na(Freedman))population nonwhite density crime 10 0 10 0> means <- colMeans(Freedman, na.rm=TRUE)> for (j in 1:ncol(Freedman)){+ Freedman[is.na(Freedman[, j]), j] <- means[j] + }> colSums(is.na(Freedman))population nonwhite density crime 0 0 0 0> colMeans(Freedman)population nonwhite density crime 1135.99000 10.80273 765.67000 2714.08182> meanspopulation nonwhite density crime 1135.99000 10.80273 765.67000 2714.08182 Now you should probably think about whether you really want to do this... Best, John On Mon, 29 Jul 2013 18:39:48 +0200 "iza.ch1" <iza.ch1 at op.pl> wrote:> Hi everyone > > I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data > > de > [,1] [,2] [,3] > [1,] NA -0.26928087 -0.1192078 > [2,] NA 1.20925752 0.9325334 > [3,] NA 0.38012008 -1.8927164 > [4,] NA -0.41778861 1.4330507 > [5,] NA -0.49677462 0.2892706 > [6,] NA -0.13248754 1.3976522 > [7,] NA -0.54179054 0.2295291 > [8,] NA 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > and I wrote the code > de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) > > I get as the result > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.12222376 1.20925752 0.9325334 > [3,] -0.13412312 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.12222376 -0.49677462 0.2892706 > [6,] -0.13412312 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.12222376 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. > I want to have the result like this: > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.50575168 1.20925752 0.9325334 > [3,] -0.50575168 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.50575168 -0.49677462 0.2892706 > [6,] -0.50575168 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.50575168 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > Thanks in advance > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jorge I Velez
2013-Jul-29 17:32 UTC
[R] replace Na values with the mean of the column which contains them
Consider the following: f <- function(x){ m <- mean(x, na.rm = TRUE) x[is.na(x)] <- m x } apply(de, 2, f) HTH, Jorge.- On Tue, Jul 30, 2013 at 2:39 AM, iza.ch1 <iza.ch1@op.pl> wrote:> Hi everyone > > I have a problem with replacing the NA values with the mean of the column > which contains them. If I replace Na with the means of the rest values in > the column, the mean of the whole column will be still the same as if I > would have omitted NA values. I have the following data > > de > [,1] [,2] [,3] > [1,] NA -0.26928087 -0.1192078 > [2,] NA 1.20925752 0.9325334 > [3,] NA 0.38012008 -1.8927164 > [4,] NA -0.41778861 1.4330507 > [5,] NA -0.49677462 0.2892706 > [6,] NA -0.13248754 1.3976522 > [7,] NA -0.54179054 0.2295291 > [8,] NA 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > and I wrote the code > de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) > {mean(de[,i],na.rm=TRUE)}) > > I get as the result > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.12222376 1.20925752 0.9325334 > [3,] -0.13412312 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.12222376 -0.49677462 0.2892706 > [6,] -0.13412312 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.12222376 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > It has replaced the NA values in first column with mean of first column > -0.505... and second cell with mean of second column etc. > I want to have the result like this: > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.50575168 1.20925752 0.9325334 > [3,] -0.50575168 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.50575168 -0.49677462 0.2892706 > [6,] -0.50575168 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.50575168 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > Thanks in advance > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Berend Hasselman
2013-Jul-29 17:33 UTC
[R] replace Na values with the mean of the column which contains them
On 29-07-2013, at 18:39, "iza.ch1" <iza.ch1 at op.pl> wrote:> Hi everyone > > I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data > > de > [,1] [,2] [,3] > [1,] NA -0.26928087 -0.1192078 > [2,] NA 1.20925752 0.9325334 > [3,] NA 0.38012008 -1.8927164 > [4,] NA -0.41778861 1.4330507 > [5,] NA -0.49677462 0.2892706 > [6,] NA -0.13248754 1.3976522 > [7,] NA -0.54179054 0.2295291 > [8,] NA 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > and I wrote the code > de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) > > I get as the result > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.12222376 1.20925752 0.9325334 > [3,] -0.13412312 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.12222376 -0.49677462 0.2892706 > [6,] -0.13412312 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.12222376 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. > I want to have the result like this: > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.50575168 1.20925752 0.9325334 > [3,] -0.50575168 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.50575168 -0.49677462 0.2892706 > [6,] -0.50575168 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.50575168 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 >or this: apply(de,2, function(x) {x[which(is.na(x))] <- mean(x,na.rm=TRUE);x}) Berend
arun
2013-Jul-29 17:57 UTC
[R] replace Na values with the mean of the column which contains them
Hi, de<- structure(c(NA, NA, NA, NA, NA, NA, NA, NA, 0.27500571, -3.07568579, -0.42240954, -0.26901731, 0.01766284, -0.8099958, 0.20805934, 0.03036708, -0.26928087, 1.20925752, 0.38012008, -0.41778861, -0.49677462, -0.13248754, -0.54179054, 0.35788624, -0.41467591, -0.59234248, 0.73642396, -0.06768044, -0.40321968, -1.52283305, 0.25974308, -0.0401373, -0.1192078, 0.9325334, -1.8927164, 1.4330507, 0.2892706, 1.3976522, 0.2295291, -0.5009389, -0.342656, -0.8439027, -0.4971999, -1.6127122, -0.6508823, 1.4729576, -1.6093478, 0.1686006 ), .Dim = c(16L, 3L)) Your code should be: sapply(seq_len(ncol(de)),function(i) {de[,i][is.na(de[,i])]<-mean(de[,i],na.rm=TRUE);de[,i]}) A.K. Hi everyone I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data de ? ? ?[,1] ? ? ? ?[,2] ? ? ? [,3] ?[1,] ? ? ? ? ?NA -0.26928087 -0.1192078 ?[2,] ? ? ? ? ?NA ?1.20925752 ?0.9325334 ?[3,] ? ? ? ? ?NA ?0.38012008 -1.8927164 ?[4,] ? ? ? ? ?NA -0.41778861 ?1.4330507 ?[5,] ? ? ? ? ?NA -0.49677462 ?0.2892706 ?[6,] ? ? ? ? ?NA -0.13248754 ?1.3976522 ?[7,] ? ? ? ? ?NA -0.54179054 ?0.2295291 ?[8,] ? ? ? ? ?NA ?0.35788624 -0.5009389 ?[9,] ?0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 ?0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] ?0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 ?1.4729576 [15,] ?0.20805934 ?0.25974308 -1.6093478 [16,] ?0.03036708 -0.04013730 ?0.1686006 and I wrote the code de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) I get as the result ? ?[,1] ? ? ? ?[,2] ? ? ? [,3] ?[1,] -0.50575168 -0.26928087 -0.1192078 ?[2,] -0.12222376 ?1.20925752 ?0.9325334 ?[3,] -0.13412312 ?0.38012008 -1.8927164 ?[4,] -0.50575168 -0.41778861 ?1.4330507 ?[5,] -0.12222376 -0.49677462 ?0.2892706 ?[6,] -0.13412312 -0.13248754 ?1.3976522 ?[7,] -0.50575168 -0.54179054 ?0.2295291 ?[8,] -0.12222376 ?0.35788624 -0.5009389 ?[9,] ?0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 ?0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] ?0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 ?1.4729576 [15,] ?0.20805934 ?0.25974308 -1.6093478 [16,] ?0.03036708 -0.04013730 ?0.1686006 It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. I want to have the result like this: [,1] ? ? ? ?[,2] ? ? ? [,3] ?[1,] -0.50575168 -0.26928087 -0.1192078 ?[2,] -0.50575168 ?1.20925752 ?0.9325334 ?[3,] -0.50575168 ?0.38012008 -1.8927164 ?[4,] -0.50575168 -0.41778861 ?1.4330507 ?[5,] -0.50575168 -0.49677462 ?0.2892706 ?[6,] -0.50575168 -0.13248754 ?1.3976522 ?[7,] -0.50575168 -0.54179054 ?0.2295291 ?[8,] -0.50575168 ?0.35788624 -0.5009389 ?[9,] ?0.27500571 -0.41467591 -0.3426560 [10,] -3.07568579 -0.59234248 -0.8439027 [11,] -0.42240954 ?0.73642396 -0.4971999 [12,] -0.26901731 -0.06768044 -1.6127122 [13,] ?0.01766284 -0.40321968 -0.6508823 [14,] -0.80999580 -1.52283305 ?1.4729576 [15,] ?0.20805934 ?0.25974308 -1.6093478 [16,] ?0.03036708 -0.04013730 ?0.1686006 Thanks in advance
David Winsemius
2013-Jul-29 18:59 UTC
[R] replace Na values with the mean of the column which contains them
On Jul 29, 2013, at 9:39 AM, iza.ch1 wrote:> Hi everyone > > I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data > > de > [,1] [,2] [,3] > [1,] NA -0.26928087 -0.1192078 > [2,] NA 1.20925752 0.9325334 > [3,] NA 0.38012008 -1.8927164 > [4,] NA -0.41778861 1.4330507 > [5,] NA -0.49677462 0.2892706 > [6,] NA -0.13248754 1.3976522 > [7,] NA -0.54179054 0.2295291 > [8,] NA 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006Why not replace with a result that would have both the same mean and standard deviation as the existing data? set.seed(123) de[,1][is.na(de[,1])] <- rnorm(sum(is.na(de[,1]), #specify the number of random values mean(de[,1],na.rm=TRUE), sd(de[,1],na.rm=TRUE ) ) ) -- David.> > and I wrote the code > de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)}) > > I get as the result > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.12222376 1.20925752 0.9325334 > [3,] -0.13412312 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.12222376 -0.49677462 0.2892706 > [6,] -0.13412312 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.12222376 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc. > I want to have the result like this: > [,1] [,2] [,3] > [1,] -0.50575168 -0.26928087 -0.1192078 > [2,] -0.50575168 1.20925752 0.9325334 > [3,] -0.50575168 0.38012008 -1.8927164 > [4,] -0.50575168 -0.41778861 1.4330507 > [5,] -0.50575168 -0.49677462 0.2892706 > [6,] -0.50575168 -0.13248754 1.3976522 > [7,] -0.50575168 -0.54179054 0.2295291 > [8,] -0.50575168 0.35788624 -0.5009389 > [9,] 0.27500571 -0.41467591 -0.3426560 > [10,] -3.07568579 -0.59234248 -0.8439027 > [11,] -0.42240954 0.73642396 -0.4971999 > [12,] -0.26901731 -0.06768044 -1.6127122 > [13,] 0.01766284 -0.40321968 -0.6508823 > [14,] -0.80999580 -1.52283305 1.4729576 > [15,] 0.20805934 0.25974308 -1.6093478 > [16,] 0.03036708 -0.04013730 0.1686006 > > Thanks in advance > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA