Dear all
I have written a function to perform a very simple but useful task which I
do regularly. It is designed to show how many values are missing from each
variable in a data.frame. In its current form it works but is slow because I
have used several loops to achieve this simple task.
Can anyone see a more efficient way to get the same results? Or is there
existing function which does this?
Thanks for your help
Tim
Function:
miss <- function (data)
{
miss.list <- list(NA)
for (i in 1:length(data)) {
miss.list[[i]] <- table(is.na(data[i]))
}
for (i in 1:length(miss.list)) {
if (length(miss.list[[i]]) == 2) {
miss.list[[i]] <- miss.list[[i]][2]
}
}
for (i in 1:length(miss.list)) {
if (names(miss.list[[i]]) == "FALSE") {
miss.list[[i]] <- 0
}
}
data.frame(names(data), as.numeric(miss.list))
}
Example:
data(ToothGrowth)
data.m <- ToothGrowth
data.m$supp[sample(1:nrow(data.m), size=25)] <- NA
miss(data.m)
[[alternative HTML version deleted]]
Hi Hi try colSums(is.na(data.m)) It is not in data frame but you can easily transform it if you want. Regards Petr r-help-bounces at r-project.org napsal dne 19.04.2011 09:29:08:> Dear all > > > > I have written a function to perform a very simple but useful task whichI> do regularly. It is designed to show how many values are missing fromeach> variable in a data.frame. In its current form it works but is slowbecause I> have used several loops to achieve this simple task. > > > > Can anyone see a more efficient way to get the same results? Or is there > existing function which does this? > > > > Thanks for your help > > Tim > > > > Function: > > miss <- function (data) > > { > > miss.list <- list(NA) > > for (i in 1:length(data)) { > > miss.list[[i]] <- table(is.na(data[i])) > > } > > for (i in 1:length(miss.list)) { > > if (length(miss.list[[i]]) == 2) { > > miss.list[[i]] <- miss.list[[i]][2] > > } > > } > > for (i in 1:length(miss.list)) { > > if (names(miss.list[[i]]) == "FALSE") { > > miss.list[[i]] <- 0 > > } > > } > > data.frame(names(data), as.numeric(miss.list)) > > } > > > > Example: > > data(ToothGrowth) > > data.m <- ToothGrowth > > data.m$supp[sample(1:nrow(data.m), size=25)] <- NA > > miss(data.m) > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
On Tue, Apr 19, 2011 at 03:29:08PM +0800, Tim Elwell-Sutton wrote:> Dear all > > > > I have written a function to perform a very simple but useful task which I > do regularly. It is designed to show how many values are missing from each > variable in a data.frame. In its current form it works but is slow because I > have used several loops to achieve this simple task.Why not use summary?> foo <- data.frame(a=c(1,3,4,NA), b=c(NA,4,NA,8), c=factor(c('A', NA, 'A', 'B'))) > summary(foo)a b c Min. :1.000 Min. :4 A :2 1st Qu.:2.000 1st Qu.:5 B :1 Median :3.000 Median :6 NA's:1 Mean :2.667 Mean :6 3rd Qu.:3.500 3rd Qu.:7 Max. :4.000 Max. :8 NA's :1.000 NA's :2 cu Philipp -- Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan Maximus-von-Imhof-Forum 3 85354 Freising, Germany http://webclu.bio.wzw.tum.de/~pagel/
On Tue, Apr 19, 2011 at 03:29:08PM +0800, Tim Elwell-Sutton wrote:> Dear all > > > > I have written a function to perform a very simple but useful task which I > do regularly. It is designed to show how many values are missing from each > variable in a data.frame. In its current form it works but is slow because I > have used several loops to achieve this simple task.Oh - and in case you ONLY wnt the number of NAs in each column this should be pretty efficient: lapply(foo, function(x){sum(is.na(x))}) cu Philipp -- Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan Maximus-von-Imhof-Forum 3 85354 Freising, Germany http://webclu.bio.wzw.tum.de/~pagel/
I use the following code/function which gives me some quick descriptives about
each variable (ie. n of missing values, % missing, case #'s missing, etc.):
Fairly quick, maybe not pretty but effective on either single variables or
entire data sets.
NAhunter<-function(dataset)
{
find.NA<-function(variable)
{
if(is.numeric(variable)){
n<-length(variable)
mean<-mean(variable, na.rm=T)
median<-median(variable, na.rm=T)
sd<-sd(variable, na.rm=T)
NAs<-is.na(variable)
total.NA<-sum(NAs)
percent.missing<-total.NA/n
descriptives<-data.frame(n,mean,median,sd,total.NA,percent.missing)
rownames(descriptives)<-c(" ")
Case.Number<-1:n
Missing.Values<-ifelse(NAs>0,"Missing Value"," ")
missing.value<-data.frame(Case.Number,Missing.Values)
missing.values<-missing.value[ which(Missing.Values=='Missing
Value'),]
list("NUMERIC
DATA","DESCRIPTIVES"=t(descriptives),"CASE # OF MISSING
VALUES"=missing.values[,1])
}
else{
n<-length(variable)
NAs<-is.na(variable)
total.NA<-sum(NAs)
percent.missing<-total.NA/n
descriptives<-data.frame(n,total.NA,percent.missing)
rownames(descriptives)<-c(" ")
Case.Number<-1:n
Missing.Values<-ifelse(NAs>0,"Missing Value"," ")
missing.value<-data.frame(Case.Number,Missing.Values)
missing.values<-missing.value[ which(Missing.Values=='Missing
Value'),]
list("CATEGORICAL
DATA","DESCRIPTIVES"=t(descriptives),"CASE # OF MISSING
VALUES"=missing.values[,1])
}
}
dataset<-data.frame(dataset)
options(scipen=100)
options(digits=2)
lapply(dataset,find.NA)
}
> From: tesutton@hku.hk
> To: r-help@r-project.org
> Date: Tue, 19 Apr 2011 15:29:08 +0800
> Subject: [R] Simple Missing cases Function
>
> Dear all
>
>
>
> I have written a function to perform a very simple but useful task which I
> do regularly. It is designed to show how many values are missing from each
> variable in a data.frame. In its current form it works but is slow because
I
> have used several loops to achieve this simple task.
>
>
>
> Can anyone see a more efficient way to get the same results? Or is there
> existing function which does this?
>
>
>
> Thanks for your help
>
> Tim
>
>
>
> Function:
>
> miss <- function (data)
>
> {
>
> miss.list <- list(NA)
>
> for (i in 1:length(data)) {
>
> miss.list[[i]] <- table(is.na(data[i]))
>
> }
>
> for (i in 1:length(miss.list)) {
>
> if (length(miss.list[[i]]) == 2) {
>
> miss.list[[i]] <- miss.list[[i]][2]
>
> }
>
> }
>
> for (i in 1:length(miss.list)) {
>
> if (names(miss.list[[i]]) == "FALSE") {
>
> miss.list[[i]] <- 0
>
> }
>
> }
>
> data.frame(names(data), as.numeric(miss.list))
>
> }
>
>
>
> Example:
>
> data(ToothGrowth)
>
> data.m <- ToothGrowth
>
> data.m$supp[sample(1:nrow(data.m), size=25)] <- NA
>
> miss(data.m)
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
Dear Petr Thanks so much. That is a LOT more efficient. Tim -----Original Message----- From: Petr PIKAL [mailto:petr.pikal at precheza.cz] Sent: Tuesday, April 19, 2011 3:37 PM To: tesutton Cc: r-help at r-project.org Subject: Odp: [R] Simple Missing cases Function Hi Hi try colSums(is.na(data.m)) It is not in data frame but you can easily transform it if you want. Regards Petr r-help-bounces at r-project.org napsal dne 19.04.2011 09:29:08:> Dear all > > > > I have written a function to perform a very simple but useful task whichI> do regularly. It is designed to show how many values are missing fromeach> variable in a data.frame. In its current form it works but is slowbecause I> have used several loops to achieve this simple task. > > > > Can anyone see a more efficient way to get the same results? Or is there > existing function which does this? > > > > Thanks for your help > > Tim > > > > Function: > > miss <- function (data) > > { > > miss.list <- list(NA) > > for (i in 1:length(data)) { > > miss.list[[i]] <- table(is.na(data[i])) > > } > > for (i in 1:length(miss.list)) { > > if (length(miss.list[[i]]) == 2) { > > miss.list[[i]] <- miss.list[[i]][2] > > } > > } > > for (i in 1:length(miss.list)) { > > if (names(miss.list[[i]]) == "FALSE") { > > miss.list[[i]] <- 0 > > } > > } > > data.frame(names(data), as.numeric(miss.list)) > > } > > > > Example: > > data(ToothGrowth) > > data.m <- ToothGrowth > > data.m$supp[sample(1:nrow(data.m), size=25)] <- NA > > miss(data.m) > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.