Suppose I have a data frame, such as the one below: tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) And further assume it is sorted by index and then by the variable foo. tmp <- tmp[order(tmp$index, tmp$foo) , ] Now, I want to grab the first N rows of tmp for each index. In the end, what I want is the data frame 'result' tmp1 <- subset(tmp, index == 1) tmp2 <- subset(tmp, index == 2) tmp1 <- tmp1[1:5,] tmp2 <- tmp2[1:5,] result <- rbind(tmp1, tmp2) Does anyone see a way to subset and subsequently bind without a loop? Harold [[alternative HTML version deleted]]
Harold -
Two ways that come to mind:
1) do.call(rbind,lapply(split(tmp,tmp$index),function(x)x[1:5,]))
2) subset(tmp,unlist(tapply(foo,index,seq))<=5)
- Phil Spector
Statistical Computing Facility
Department of Statistics
UC Berkeley
spector at stat.berkeley.edu
On Mon, 20 Sep 2010, Doran, Harold wrote:
> Suppose I have a data frame, such as the one below:
>
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>
> And further assume it is sorted by index and then by the variable foo.
>
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>
> Now, I want to grab the first N rows of tmp for each index. In the end,
what I want is the data frame 'result'
>
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
>
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
>
> Does anyone see a way to subset and subsequently bind without a loop?
>
> Harold
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Hi Harold,
I thought of one way to do this, but maybe (probably) there is a faster way:
tmp <- data.frame(index = gl(3,20), foo = rnorm(60))
subset.first.x.elements <- function(INDEX, num.of.elements = 5)
{
t.INDEX <- table(factor(INDEX, levels = unique(INDEX)))
running.indexes <- unlist(sapply(t.INDEX, seq_len))
ss <- running.indexes %in% 1:num.of.elements
return(ss)
}
ss <- subset.first.x.elements(tmp[,1])
tmp[ss,]
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Mon, Sep 20, 2010 at 7:16 PM, Doran, Harold <HDoran@air.org> wrote:
> Suppose I have a data frame, such as the one below:
>
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>
> And further assume it is sorted by index and then by the variable foo.
>
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>
> Now, I want to grab the first N rows of tmp for each index. In the end,
> what I want is the data frame 'result'
>
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
>
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
>
> Does anyone see a way to subset and subsequently bind without a loop?
>
> Harold
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
On 09/20/2010 07:16 PM, Doran, Harold wrote:> tmp1 <- tmp1[1:5,] > tmp2 <- tmp2[1:5,] > result <- rbind(tmp1, tmp2) > > Does anyone see a way to subset and subsequently bind without a loop? >> do.call(rbind,lapply(split(tmp,tmp$index),head,5))index foo 1.11 1 -1.5124909 1.10 1 -1.3835811 1.20 1 -1.0906574 1.6 1 -0.8588022 1.8 1 -0.8384081 2.28 2 -2.9193984 2.36 2 -0.8782202 2.33 2 -0.7624129 2.38 2 -0.5995872 2.23 2 -0.5912392 (Sorry about the silly rownames.) Or, (HACK ALERT!)> tmp[ave(tmp$foo,tmp$index,FUN=seq_along)<=5,]index foo 11 1 -1.5124909 10 1 -1.3835811 20 1 -1.0906574 6 1 -0.8588022 8 1 -0.8384081 28 2 -2.9193984 36 2 -0.8782202 33 2 -0.7624129 38 2 -0.5995872 23 2 -0.5912392 (The silly bit in this case being that you can only ave() a numeric variable.) Or maybe:> tmp[unlist(tapply(seq_along(tmp$index), tmp$index, head,5)),]-- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Richard Tan asked a very similar question last week
('get top n rows group by a column from a dataframe').
You could use ave() to make a sequence-number-within-group
vector and choose rows with a small enough value there:
tmp[ave(integer(nrow(tmp)), tmp$index, FUN=seq_along)<=N, ]
If there are fewer than N rows for a given index this returns
all of them but does not pad their number up to N.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, September 20, 2010 10:16 AM
> To: R-help
> Subject: [R] Sorting and subsetting
>
> Suppose I have a data frame, such as the one below:
>
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>
> And further assume it is sorted by index and then by the variable foo.
>
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>
> Now, I want to grab the first N rows of tmp for each index.
> In the end, what I want is the data frame 'result'
>
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
>
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
>
> Does anyone see a way to subset and subsequently bind without a loop?
>
> Harold
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
All the solutions in this thread so far use the lapply(split(...)) paradigm either directly or indirectly. That paradigm doesn't scale. That's the likely source of quite a few 'out of memory' errors and performance issues in R. data.table doesn't do that internally, and it's syntax is pretty easy.> tmp <- data.table(index = gl(2,20), foo = rnorm(40))> tmp[, .SD[head(order(-foo),5)], by=index]index index.1 foo [1,] 1 1 1.9677303 [2,] 1 1 1.2731872 [3,] 1 1 1.1100931 [4,] 1 1 0.8194719 [5,] 1 1 0.6674880 [6,] 2 2 1.2236383 [7,] 2 2 0.9606766 [8,] 2 2 0.8654497 [9,] 2 2 0.5404112 [10,] 2 2 0.3373457>As you can see it currently repeats the group column which is a shame (on the to do list to fix). Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html Sent from the R help mailing list archive at Nabble.com.