Hi R-helpers, I've been struggling with a problem for most of the day (!) so am finally resorting to R-help. I would like to subset the columns of my dataframe based on the frequency with which the columns contain non-zero values. For example, let's say that I want to retain only those columns which contain non-zero values in at least 1% of their rows. In Excel I would calculate a row at the bottom of my data sheet and use the following function =countif(range,">0") to identify the number of non-zero cells in each column. Then, I would divide that by the number of rows to obtain the frequency of non-zero values in each column. Then, I would delete those columns with frequencies < 0.01. But, I'd like to do this in R. I think the missing link is an analog to Excel's countif function. Any ideas? Thanks! Mark [[alternative HTML version deleted]]
On Wed, 28 Jan 2009, Mark Na wrote:> Hi R-helpers, > > I've been struggling with a problem for most of the day (!) so am finally > resorting to R-help. > > I would like to subset the columns of my dataframe based on the frequency > with which the columns contain non-zero values. For example, let's say that > I want to retain only those columns which contain non-zero values in at > least 1% of their rows. > > In Excel I would calculate a row at the bottom of my data sheet and use the > following function > > =countif(range,">0") > > to identify the number of non-zero cells in each column. Then, I would > divide that by the number of rows to obtain the frequency of non-zero values > in each column. Then, I would delete those columns with frequencies < 0.01. > > But, I'd like to do this in R. I think the missing link is an analog to > Excel's countif function. Any ideas?Use something like DF[sapply(DF, function(x) mean(x) >= 0.01)] Since logical values are converted to 0/1, mean() gives the frequency (and sum() the count).> > Thanks! Mark > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
One approach to such a problem would be to use a logical vector inside the function colSums. ?colSums > DF <- data.frame(XX= runif(20), YY=runif(20)) > colSums(DF > 0.5) XX YY 11 9 > colSums(DF > -Inf) XX YY 20 20 > > colSums(DF> 0.5)/colSums(DF > -Inf) #could have used DF >= min(DF) in the denominator XX YY 0.55 0.45 -- David Winsemius On Jan 28, 2009, at 11:11 AM, Mark Na wrote:> Hi R-helpers, > > I've been struggling with a problem for most of the day (!) so am > finally > resorting to R-help. > > I would like to subset the columns of my dataframe based on the > frequency > with which the columns contain non-zero values. For example, let's > say that > I want to retain only those columns which contain non-zero values in > at > least 1% of their rows. > > In Excel I would calculate a row at the bottom of my data sheet and > use the > following function > > =countif(range,">0") > > to identify the number of non-zero cells in each column. Then, I would > divide that by the number of rows to obtain the frequency of non- > zero values > in each column. Then, I would delete those columns with frequencies > < 0.01.I don't think that would do what you describe unless you were only working with single column ranges. Functions on ranges in Excel are not calculated by column.> > > But, I'd like to do this in R. I think the missing link is an analog > to > Excel's countif function. Any ideas? > > Thanks! Mark > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.