H Roark
2011-Feb-02  15:49 UTC
[R] Efficient way to determine if a data frame has missing observations
I have a data set covering a large number of cities with values for
characteristics such as land area, population, and employment. The problem I
have is that some cities lack observations for some of the characteristics and
I'd like a quick way to determine which cities have missing data.  For
example:
city<-c("A","A","A","B","B","C")
var<-c("sqmi","pop","emp","pop","emp","pop")
value<-c(10,100,40,30,10,20)
df<-data.frame(city,var,value)
In this data frame, city A has complete data for the three variables, while city
B is missing land area, and city C only has population data. In the full data
frame, my approach to finding the missing observations has been to create a data
frame with all combinations of 'city' and 'var', merge this onto
the original data frame, and then extract the observations with missing data for
'value':
city_unq<-c("A","B","C")
var_unq<-c("sqmi","pop","emp")
comb<-expand.grid(city=city_unq,var=var_unq)
mrg<-merge(comb,df,by=c("city","var"),all=T)
missing<-mrg[is.na(mrg$value),]
This works, but on a large dataset it gets slow and I'm looking for a a more
efficient way to achieve this same result.  Any suggestions would be much
appreciated.
Cheers
 		 	   		  
	[[alternative HTML version deleted]]
Erik Iverson
2011-Feb-02  18:07 UTC
[R] Efficient way to determine if a data frame has missing observations
H Roark wrote:> I have a data set covering a large number of cities with values for characteristics such as land area, population, and employment. The problem I have is that some cities lack observations for some of the characteristics and I'd like a quick way to determine which cities have missing data. For example: > > city<-c("A","A","A","B","B","C") > var<-c("sqmi","pop","emp","pop","emp","pop") > value<-c(10,100,40,30,10,20) > df<-data.frame(city,var,value) > > In this data frame, city A has complete data for the three variables, while city B is missing land area, and city C only has population data. In the full data frame, my approach to finding the missing observations has been to create a data frame with all combinations of 'city' and 'var', merge this onto the original data frame, and then extract the observations with missing data for 'value': > > city_unq<-c("A","B","C") > var_unq<-c("sqmi","pop","emp") > comb<-expand.grid(city=city_unq,var=var_unq) > > mrg<-merge(comb,df,by=c("city","var"),all=T) > missing<-mrg[is.na(mrg$value),]Perhaps the following, or a variation thereof? subset(as.data.frame(table(city = df$city, var = df$var)), Freq == 0)
Henrique Dallazuanna
2011-Feb-02  18:13 UTC
[R] Efficient way to determine if a data frame has missing observations
Try this: subset(as.data.frame(xtabs( ~ city + var, df)), !Freq) On Wed, Feb 2, 2011 at 1:49 PM, H Roark <hrbuilder@hotmail.com> wrote:> > I have a data set covering a large number of cities with values for > characteristics such as land area, population, and employment. The problem I > have is that some cities lack observations for some of the characteristics > and I'd like a quick way to determine which cities have missing data. For > example: > > city<-c("A","A","A","B","B","C") > var<-c("sqmi","pop","emp","pop","emp","pop") > value<-c(10,100,40,30,10,20) > df<-data.frame(city,var,value) > > In this data frame, city A has complete data for the three variables, while > city B is missing land area, and city C only has population data. In the > full data frame, my approach to finding the missing observations has been to > create a data frame with all combinations of 'city' and 'var', merge this > onto the original data frame, and then extract the observations with missing > data for 'value': > > city_unq<-c("A","B","C") > var_unq<-c("sqmi","pop","emp") > comb<-expand.grid(city=city_unq,var=var_unq) > > mrg<-merge(comb,df,by=c("city","var"),all=T) > missing<-mrg[is.na(mrg$value),] > > This works, but on a large dataset it gets slow and I'm looking for a a > more efficient way to achieve this same result. Any suggestions would be > much appreciated. > > Cheers > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]