I have a dataset that looks like this (many other variables not shown. including a unique row identifier "id"):> summary(hits)query lib coverage percid Length:80664 Length:80664 Min. :0.080 Min. :0.2250 Mode :character Mode :character 1st Qu.:0.980 1st Qu.:0.8160 Median :1.000 Median :0.9230 Mean :0.946 Mean :0.8536 3rd Qu.:1.000 3rd Qu.:0.9900 Max. :1.000 Max. :1.0000 For any query/lib combination there may be 1 or more rows of data. I'd like to be able to specify only the rows for each query/lib combination that have the maximum (or minimum or whatever) coverage or percid or some other data element, and carry along the other corresponding data elements from that same row. I know I can do this procedurally in a loop: query <- c('') lib <- c('') coverage <- c(0) percid <- c(0) for(q in unique(hits$query)) { for(l in unique(hits$lib[hits$query == q])) { query <- c(query, q) lib <- c(lib, l) max.coverage <- 0 for(id in hits$id[hits$query == q & hits$lib == l]) { if(hits$coverage[hits$id == id] > max.coverage) { max.coverage.id <- id max.coverage <- hits$coverage[hits$id == id] } } coverage <- c(coverage, hits$coverage[hits$id == max.coverage.id]) percid <- c(percid, hits$percid[hits$id == max.coverage.id]) } } filtered.hits <- data.frame(query=query[2:length(query)], lib=lib[2:length(lib)], coverage=coverage[2:length(coverage)], percid=percid[2:length(percid)] ) # finally get to do something with it now: plot(filtered.hits$coverage[filtered.hits$query == 'ABC'], filtered.hits$percid[filtered.hits$query == 'ABC'] ) So, how could I accomplish the same plot as above without the looping and creating a new dataframe? Thanks, -Aaron -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 27 Nov 2001, Aaron J Mackey wrote:> > I have a dataset that looks like this (many other variables not > shown. including a unique row identifier "id"): > > > summary(hits) > query lib coverage percid > Length:80664 Length:80664 Min. :0.080 Min. :0.2250 > Mode :character Mode :character 1st Qu.:0.980 1st Qu.:0.8160 > Median :1.000 Median :0.9230 > Mean :0.946 Mean :0.8536 > 3rd Qu.:1.000 3rd Qu.:0.9900 > Max. :1.000 Max. :1.0000 > > For any query/lib combination there may be 1 or more rows of data. I'd > like to be able to specify only the rows for each query/lib combination > that have the maximum (or minimum or whatever) coverage or percid or some > other data element, and carry along the other corresponding data elements > from that same row. > > I know I can do this procedurally in a loop: ><snip: he does it>> > So, how could I accomplish the same plot as above without the looping and > creating a new dataframe?Well, for one query you can do the subsetting like hits[hits$coverage == max(hits$coverage),] so for many queries you could tapply() or by() this process filter<-function(this.subset){ this.subset[this.subset$coverage==max(this.subset$coverage),] } filtered<-by(hits, hits$query,filter) This produces a list of dataframes, so you want to staple them back together filtered<-do.call("rbind",filtered) In the case of maximum or minimum it would be faster to use the which.max/which.min functions instead of the expression hits$coverage == max(hits$coverage) This may or may not be faster than the loop, but it should be easier to read (at least if you understand by()). -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Possibly Parallel Threads
- combine vector and data frame on field?
- plotting histograms/density plots in a triangular layout?
- Matrix eigenvectors in R and MatLab
- NaN causes "error in fitter" with cph.calibrate from pkg Design
- Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator