I've completed an experiment and want to summarize the results. There are two things I like to create. 1) A simple count of things from the data.frame with predictions 1a) Number of predictions with probability greater than x 1b) Number of predictions with probability greater than x that are really true In SQL, this would be, "Select count(predictions) from data.frame where probability > x" "Select count(predictions) from data.frame where probability > x and label ='T' " How can I do this one in R? 2) I'd like to create what we call "binning". It is a simple list of probability ranges and how accurate our model is. The idea is to see how "true" our probabilities are. for example range number of items mean(probability) true_accuracy 100-90% 20 .924 .90 90-80% 50 .825 .84 80-70% 214 .75 .71 etc... It would be really great if I could also graph this! Is there any kind of package or way to do this in R Thanks! -N
Try this using built in data frame iris:> length(subset(iris, Sepal.Length >= 7, Sepal.Width)[[1]])[1] 13> length(subset(iris, Sepal.Length >= 7 & Species == 'virginica', Sepal.Width)[[1]])[1] 12> # or the following (note that dot in Sepal.Length is automatically > # converted to _ because dot has special meaning in sql)> library(sqldf) > sqldf("select count(*) from iris where Sepal_Length >= 7")count(*) 1 13> sqldf("select count(*) from iris where Sepal_Length >= 7 and Species = 'virginica'")count(*) 1 12 For the second part use cut to create a factor with the levels you want iris$Sepal.Length.factor <- cut(iris$Sepal.Length, 4:8) and then summarize as desired using sql such as:> sqldf("select Sepal_Length_factor, avg(Sepal_Length), count(Sepal_Length) from iris group by Sepal_Length_factor")Sepal_Length_factor avg(Sepal_Length) count(Sepal_Length) 1 (4,5] 4.787500 32 2 (5,6] 5.550877 57 3 (6,7] 6.473469 49 4 (7,8] 7.475000 12 or use summaryBy the in the doBy package. See ?cut, ?subset, and in doBy see ?summaryBy Also see http://sqldf.googlecode.com On Tue, Aug 4, 2009 at 11:40 PM, Noah Silverman<noah at smartmediacorp.com> wrote:> I've completed an experiment and want to summarize the results. > > There are two things I like to create. > > 1) A simple count of things from the data.frame with predictions > ? ?1a) Number of predictions with probability greater than x > ? ?1b) Number of predictions with probability greater than x that are really > true > > ? ?In SQL, this would be, > ? ? ? ?"Select count(predictions) from data.frame where probability > x" > "Select count(predictions) from data.frame where probability > x and label > ='T' " > > How can I do this one in R? > > > 2) I'd like to create what we call "binning". ?It is a simple list of > probability ranges and how accurate our model is. ?The idea is to see how > "true" our probabilities are. > for example > > range ? ? ? ?number of items ? ? ? ?mean(probability) ? true_accuracy > 100-90% ? ? ? ?20 ? ? ? ? ? ? ? ? ? ? ? ? ? ?.924 ? ? ? ? ? ? ? ? ? ?.90 > 90-80% ? ? ? ? ?50 ? ? ? ? ? ? ? ? ? ? ? ? ? ?.825 ? ? ? ? ? ? ? ? ? ?.84 > 80-70% ? ? ? ? ?214 ? ? ? ? ? ? ? ? ? ? ? ? ?.75 ? ? ? ? ? ? ? ? ? ? ?.71 > etc... > > It would be really great if I could also graph this! > > Is there any kind of package or way to do this in R > > Thanks! > > -N > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Noah Silverman > Sent: Tuesday, August 04, 2009 8:40 PM > To: r help > Subject: [R] Counting things > > I've completed an experiment and want to summarize the results. > > There are two things I like to create. > > 1) A simple count of things from the data.frame with predictions > 1a) Number of predictions with probability greater than xsum(logicalVector) returns the number of TRUEs in logicalVector, because it converts TRUE to 1 and FALSE to 0 before doing the sum. You will have to use na.rm=TRUE if there are NA's (missing values) in logical vector. Hence you get compute 1a with sum(probabilities>x) mean(probabilities>x) will give the proportion of times probabilities>x is TRUE. table(probabilities>x) will give a count of both the FALSEs and TRUEs.> 1b) Number of predictions with probability greater than > x that are really truesum(probabilities>x & label=="T") (I'm guessing that label is a character or factor vector with values "T" and "F".) Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com> > In SQL, this would be, > "Select count(predictions) from data.frame where > probability > x" > "Select count(predictions) from data.frame where probability > x and > label ='T' " > > How can I do this one in R? > > > 2) I'd like to create what we call "binning". It is a simple list of > probability ranges and how accurate our model is. The idea is to see > how "true" our probabilities are. > for example > > range number of items mean(probability) true_accuracy > 100-90% 20 .924 > .90 > 90-80% 50 .825 > .84 > 80-70% 214 .75 > .71 > etc... > > It would be really great if I could also graph this! > > Is there any kind of package or way to do this in R > > Thanks! > > -N > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >