Suppose I have a data frame, such as the one below: tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) And further assume it is sorted by index and then by the variable foo. tmp <- tmp[order(tmp$index, tmp$foo) , ] Now, I want to grab the first N rows of tmp for each index. In the end, what I want is the data frame 'result' tmp1 <- subset(tmp, index == 1) tmp2 <- subset(tmp, index == 2) tmp1 <- tmp1[1:5,] tmp2 <- tmp2[1:5,] result <- rbind(tmp1, tmp2) Does anyone see a way to subset and subsequently bind without a loop? Harold [[alternative HTML version deleted]]
Harold - Two ways that come to mind: 1) do.call(rbind,lapply(split(tmp,tmp$index),function(x)x[1:5,])) 2) subset(tmp,unlist(tapply(foo,index,seq))<=5) - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Mon, 20 Sep 2010, Doran, Harold wrote:> Suppose I have a data frame, such as the one below: > > tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) > > And further assume it is sorted by index and then by the variable foo. > > tmp <- tmp[order(tmp$index, tmp$foo) , ] > > Now, I want to grab the first N rows of tmp for each index. In the end, what I want is the data frame 'result' > > tmp1 <- subset(tmp, index == 1) > tmp2 <- subset(tmp, index == 2) > > tmp1 <- tmp1[1:5,] > tmp2 <- tmp2[1:5,] > result <- rbind(tmp1, tmp2) > > Does anyone see a way to subset and subsequently bind without a loop? > > Harold > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi Harold, I thought of one way to do this, but maybe (probably) there is a faster way: tmp <- data.frame(index = gl(3,20), foo = rnorm(60)) subset.first.x.elements <- function(INDEX, num.of.elements = 5) { t.INDEX <- table(factor(INDEX, levels = unique(INDEX))) running.indexes <- unlist(sapply(t.INDEX, seq_len)) ss <- running.indexes %in% 1:num.of.elements return(ss) } ss <- subset.first.x.elements(tmp[,1]) tmp[ss,] ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Mon, Sep 20, 2010 at 7:16 PM, Doran, Harold <HDoran@air.org> wrote:> Suppose I have a data frame, such as the one below: > > tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) > > And further assume it is sorted by index and then by the variable foo. > > tmp <- tmp[order(tmp$index, tmp$foo) , ] > > Now, I want to grab the first N rows of tmp for each index. In the end, > what I want is the data frame 'result' > > tmp1 <- subset(tmp, index == 1) > tmp2 <- subset(tmp, index == 2) > > tmp1 <- tmp1[1:5,] > tmp2 <- tmp2[1:5,] > result <- rbind(tmp1, tmp2) > > Does anyone see a way to subset and subsequently bind without a loop? > > Harold > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On 09/20/2010 07:16 PM, Doran, Harold wrote:> tmp1 <- tmp1[1:5,] > tmp2 <- tmp2[1:5,] > result <- rbind(tmp1, tmp2) > > Does anyone see a way to subset and subsequently bind without a loop? >> do.call(rbind,lapply(split(tmp,tmp$index),head,5))index foo 1.11 1 -1.5124909 1.10 1 -1.3835811 1.20 1 -1.0906574 1.6 1 -0.8588022 1.8 1 -0.8384081 2.28 2 -2.9193984 2.36 2 -0.8782202 2.33 2 -0.7624129 2.38 2 -0.5995872 2.23 2 -0.5912392 (Sorry about the silly rownames.) Or, (HACK ALERT!)> tmp[ave(tmp$foo,tmp$index,FUN=seq_along)<=5,]index foo 11 1 -1.5124909 10 1 -1.3835811 20 1 -1.0906574 6 1 -0.8588022 8 1 -0.8384081 28 2 -2.9193984 36 2 -0.8782202 33 2 -0.7624129 38 2 -0.5995872 23 2 -0.5912392 (The silly bit in this case being that you can only ave() a numeric variable.) Or maybe:> tmp[unlist(tapply(seq_along(tmp$index), tmp$index, head,5)),]-- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Richard Tan asked a very similar question last week ('get top n rows group by a column from a dataframe'). You could use ave() to make a sequence-number-within-group vector and choose rows with a small enough value there: tmp[ave(integer(nrow(tmp)), tmp$index, FUN=seq_along)<=N, ] If there are fewer than N rows for a given index this returns all of them but does not pad their number up to N. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold > Sent: Monday, September 20, 2010 10:16 AM > To: R-help > Subject: [R] Sorting and subsetting > > Suppose I have a data frame, such as the one below: > > tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) > > And further assume it is sorted by index and then by the variable foo. > > tmp <- tmp[order(tmp$index, tmp$foo) , ] > > Now, I want to grab the first N rows of tmp for each index. > In the end, what I want is the data frame 'result' > > tmp1 <- subset(tmp, index == 1) > tmp2 <- subset(tmp, index == 2) > > tmp1 <- tmp1[1:5,] > tmp2 <- tmp2[1:5,] > result <- rbind(tmp1, tmp2) > > Does anyone see a way to subset and subsequently bind without a loop? > > Harold > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
All the solutions in this thread so far use the lapply(split(...)) paradigm either directly or indirectly. That paradigm doesn't scale. That's the likely source of quite a few 'out of memory' errors and performance issues in R. data.table doesn't do that internally, and it's syntax is pretty easy.> tmp <- data.table(index = gl(2,20), foo = rnorm(40))> tmp[, .SD[head(order(-foo),5)], by=index]index index.1 foo [1,] 1 1 1.9677303 [2,] 1 1 1.2731872 [3,] 1 1 1.1100931 [4,] 1 1 0.8194719 [5,] 1 1 0.6674880 [6,] 2 2 1.2236383 [7,] 2 2 0.9606766 [8,] 2 2 0.8654497 [9,] 2 2 0.5404112 [10,] 2 2 0.3373457>As you can see it currently repeats the group column which is a shame (on the to do list to fix). Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html Sent from the R help mailing list archive at Nabble.com.