"Unternährer Thomas, uth"
2003-Sep-23 12:23 UTC
AW: [R] Rank and extract data from a series
Hi,>I would like to rank a time-series of data, extract the top ten data items from this series, determine the >corresponding row numbers for each value in the sample, and take a mean of these *row numbers* (not the data).>I would like to do this in R, rather than pre-process the data on the UNIX command line if possible, as I need to >calculate other statistics for the series.>I understand that I can use 'sort' to order the data, but I am not aware of a function in R that would allow me >to extract a given number of these data and then determine their positions within the original time series.>e.g.>Time series:>1.0 (row 1) >4.5 (row 2) >2.3 (row 3) >1.0 (row 4) >7.3 (row 5)>Sort would give me:>1.0 >1.0 >2.3 >4.5 >7.3>I would then like to extract the top two data items:>4.5 >7.3>and determine their positions within the original (unsorted) time series:>4.5 = row 2 >7.3 = row 5>then take a mean:>2 and 5 = 3.5>Thanks in advance.>James BrownX <- c(1, 4.5, 2.3, 1, 7.3) X1 <- sort(X, decreasing=TRUE)[1:2] X2 <- match(X1, X) mean(X2) Hope this helps Thomas ___________________________________________ James Brown Cambridge Coastal Research Unit (CCRU) Department of Geography University of Cambridge Downing Place Cambridge CB2 3EN, UK Telephone: +44 (0)1223 339776 Mobile: 07929 817546 Fax: +44 (0)1223 355674 E-mail: jdb33 at cam.ac.uk E-mail: james_510 at hotmail.com http://www.geog.cam.ac.uk/ccru/CCRU.html ___________________________________________ On Wed, 10 Sep 2003, Jerome Asselin wrote:> On September 10, 2003 04:03 pm, Kevin S. Van Horn wrote: > > > > Your method looks like a naive reimplementation of integration, and > > won't work so well for distributions that have the great majority of > > the probability mass concentrated in a small fraction of the sample > > space. I was hoping for something that would retain the > > adaptability of integrate(). > > Yesterday, I've suggested to use approxfun(). Did you consider my > suggestion? Below is an example. > > N <- 500 > x <- rexp(N) > y <- rank(x)/(N+1) > empCDF <- approxfun(x,y) > xvals <- seq(0,4,.01) > plot(xvals,empCDF(xvals),type="l", > xlab="Quantile",ylab="Cumulative Distribution Function") > lines(xvals,pexp(xvals),lty=2) > legend(2,.4,c("Empirical CDF","Exact CDF"),lty=1:2) > > > It's possible to tune in some parameters in approxfun() to better > match your personal preferences. Have a look at help(approxfun) for > details. > > HTH, > Jerome Asselin > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Using Thomas Untern?hrer's handy example, one could also do: > X <- c(1, 4.5, 2.3, 1, 7.3) > mean(order(X, decreasing=TRUE)[1:2]) [1] 3.5 > I think this will give the same results as Thomas Untern?hrer's suggested code in almost all cases, but it is perhaps more concise and direct (provided that you don't actually need the values of the top items). (of course you have to change the 1:2 to 1:10 for your needs). Note that this question gets tricky if there are ties such that there is no unique set of row numbers that identify N "top" items. For example, consider the following data: > X <- c(1,3,2,3,4) Taking "top two", should the answer be 3.5 (avg of row numbers 2 and 5), 4.5 (avg of row numbers 4 and 5), or 3.666667 (avg of row numbers 2,4 and 5)? > mean(order(X, decreasing=TRUE)[1:2]) [1] 3.5 > order(X, decreasing=TRUE)[1:2] [1] 5 2 > # Andy Liaw's suggestion: > mean(which(X %in% sort(X, decreasing=TRUE)[1:2])) [1] 3.666667 > which(X %in% sort(X, decreasing=TRUE)[1:2]) [1] 2 4 5 > # Thomas Untern?hrer's suggestion: > mean(match(sort(X, decreasing=TRUE)[1:2], X)) [1] 3.5 > match(sort(X, decreasing=TRUE)[1:2], X) [1] 5 2 > hope this helps, Tony Plate At Tuesday 02:23 PM 9/23/2003 +0200, Untern?hrer Thomas, uth wrote:>Hi, > > >I would like to rank a time-series of data, extract the top ten data > items from this series, determine the > >corresponding row numbers for each value in the sample, and take a mean > of these *row numbers* (not the data). > > >I would like to do this in R, rather than pre-process the data on the > UNIX command line if possible, as I need to >calculate other statistics > for the series. > > >I understand that I can use 'sort' to order the data, but I am not aware > of a function in R that would allow me > >to extract a given number of these data and then determine their > positions within the original time series. > > >e.g. > > >Time series: > > >1.0 (row 1) > >4.5 (row 2) > >2.3 (row 3) > >1.0 (row 4) > >7.3 (row 5) > > >Sort would give me: > > >1.0 > >1.0 > >2.3 > >4.5 > >7.3 > > >I would then like to extract the top two data items: > > >4.5 > >7.3 > > >and determine their positions within the original (unsorted) time series: > > >4.5 = row 2 > >7.3 = row 5 > > >then take a mean: > > >2 and 5 = 3.5 > > >Thanks in advance. > > >James Brown > >X <- c(1, 4.5, 2.3, 1, 7.3) >X1 <- sort(X, decreasing=TRUE)[1:2] >X2 <- match(X1, X) >mean(X2) > > > >Hope this helps > >Thomas > > >___________________________________________ > >James Brown > >Cambridge Coastal Research Unit (CCRU) >Department of Geography >University of Cambridge >Downing Place >Cambridge >CB2 3EN, UK > >Telephone: +44 (0)1223 339776 >Mobile: 07929 817546 >Fax: +44 (0)1223 355674 > >E-mail: jdb33 at cam.ac.uk >E-mail: james_510 at hotmail.com > >http://www.geog.cam.ac.uk/ccru/CCRU.html >___________________________________________ > > > > > > >On Wed, 10 Sep 2003, Jerome Asselin wrote: > > > On September 10, 2003 04:03 pm, Kevin S. Van Horn wrote: > > > > > > Your method looks like a naive reimplementation of integration, and > > > won't work so well for distributions that have the great majority of > > > the probability mass concentrated in a small fraction of the sample > > > space. I was hoping for something that would retain the > > > adaptability of integrate(). > > > > Yesterday, I've suggested to use approxfun(). Did you consider my > > suggestion? Below is an example. > > > > N <- 500 > > x <- rexp(N) > > y <- rank(x)/(N+1) > > empCDF <- approxfun(x,y) > > xvals <- seq(0,4,.01) > > plot(xvals,empCDF(xvals),type="l", > > xlab="Quantile",ylab="Cumulative Distribution Function") > > lines(xvals,pexp(xvals),lty=2) > > legend(2,.4,c("Empirical CDF","Exact CDF"),lty=1:2) > > > > > > It's possible to tune in some parameters in approxfun() to better > > match your personal preferences. Have a look at help(approxfun) for > > details. > > > > HTH, > > Jerome Asselin > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-helpTony Plate tplate at acm.org