thr3ads.net - R help - [R] Sorting and subsetting [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2010-Sep-20 17:16 UTC

[R] Sorting and subsetting

Suppose I have a data frame, such as the one below:

tmp <- data.frame(index = gl(2,20), foo = rnorm(40))

And further assume it is sorted by index and then by the variable foo.

tmp <- tmp[order(tmp$index, tmp$foo) , ]

Now, I want to grab the first N rows of tmp for each index. In the end, what I
want is the data frame 'result'

tmp1 <- subset(tmp, index == 1)
tmp2 <- subset(tmp, index == 2)

tmp1 <- tmp1[1:5,]
tmp2 <- tmp2[1:5,]
result <- rbind(tmp1, tmp2)

Does anyone see a way to subset and subsequently bind without a loop?

Harold



	[[alternative HTML version deleted]]

Phil Spector

2010-Sep-20 17:27 UTC

head link

[R] Sorting and subsetting

Harold -
    Two ways that come to mind:

1) do.call(rbind,lapply(split(tmp,tmp$index),function(x)x[1:5,]))
2) subset(tmp,unlist(tapply(foo,index,seq))<=5)

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu



On Mon, 20 Sep 2010, Doran, Harold wrote:
> Suppose I have a data frame, such as the one below:
>
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>
> And further assume it is sorted by index and then by the variable foo.
>
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>
> Now, I want to grab the first N rows of tmp for each index. In the end,
what I want is the data frame 'result'
>
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
>
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
>
> Does anyone see a way to subset and subsequently bind without a loop?
>
> Harold
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Tal Galili

2010-Sep-20 17:31 UTC

head link

[R] Sorting and subsetting

Hi Harold,

I thought of one way to do this, but maybe (probably) there is a faster way:


tmp <- data.frame(index = gl(3,20), foo = rnorm(60))

subset.first.x.elements <- function(INDEX, num.of.elements = 5)
{
t.INDEX <- table(factor(INDEX, levels = unique(INDEX)))
 running.indexes <- unlist(sapply(t.INDEX, seq_len))
ss <- running.indexes %in% 1:num.of.elements

return(ss)
}

ss <- subset.first.x.elements(tmp[,1])
tmp[ss,]




----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Mon, Sep 20, 2010 at 7:16 PM, Doran, Harold <HDoran@air.org> wrote:
> Suppose I have a data frame, such as the one below:
>
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>
> And further assume it is sorted by index and then by the variable foo.
>
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>
> Now, I want to grab the first N rows of tmp for each index. In the end,
> what I want is the data frame 'result'
>
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
>
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
>
> Does anyone see a way to subset and subsequently bind without a loop?
>
> Harold
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Peter Dalgaard

2010-Sep-20 17:49 UTC

head link

[R] Sorting and subsetting

On 09/20/2010 07:16 PM, Doran, Harold wrote:> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
> 
> Does anyone see a way to subset and subsequently bind without a loop?
> 
> do.call(rbind,lapply(split(tmp,tmp$index),head,5))     index        foo
1.11     1 -1.5124909
1.10     1 -1.3835811
1.20     1 -1.0906574
1.6      1 -0.8588022
1.8      1 -0.8384081
2.28     2 -2.9193984
2.36     2 -0.8782202
2.33     2 -0.7624129
2.38     2 -0.5995872
2.23     2 -0.5912392


(Sorry about the silly rownames.)

Or, (HACK ALERT!)> tmp[ave(tmp$foo,tmp$index,FUN=seq_along)<=5,]   index        foo
11     1 -1.5124909
10     1 -1.3835811
20     1 -1.0906574
6      1 -0.8588022
8      1 -0.8384081
28     2 -2.9193984
36     2 -0.8782202
33     2 -0.7624129
38     2 -0.5995872
23     2 -0.5912392

(The silly bit in this case being that you can only ave() a numeric
variable.)

Or maybe:
> tmp[unlist(tapply(seq_along(tmp$index), tmp$index, head,5)),]



-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

William Dunlap

2010-Sep-20 18:11 UTC

head link

[R] Sorting and subsetting

Richard Tan asked a very similar question last week
('get top n rows group by a column from a dataframe').
You could use ave() to make a sequence-number-within-group
vector and choose rows with a small enough value there:
   tmp[ave(integer(nrow(tmp)), tmp$index, FUN=seq_along)<=N, ]
If there are fewer than N rows for a given index this returns
all of them but does not pad their number up to N.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, September 20, 2010 10:16 AM
> To: R-help
> Subject: [R] Sorting and subsetting
> 
> Suppose I have a data frame, such as the one below:
> 
> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
> 
> And further assume it is sorted by index and then by the variable foo.
> 
> tmp <- tmp[order(tmp$index, tmp$foo) , ]
> 
> Now, I want to grab the first N rows of tmp for each index. 
> In the end, what I want is the data frame 'result'
> 
> tmp1 <- subset(tmp, index == 1)
> tmp2 <- subset(tmp, index == 2)
> 
> tmp1 <- tmp1[1:5,]
> tmp2 <- tmp2[1:5,]
> result <- rbind(tmp1, tmp2)
> 
> Does anyone see a way to subset and subsequently bind without a loop?
> 
> Harold
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Matthew Dowle

2010-Sep-21 10:09 UTC

head link

[R] Sorting and subsetting

All the solutions in this thread so far use the lapply(split(...)) paradigm
either directly or indirectly. That paradigm doesn't scale. That's the
likely
source of quite a few 'out of memory' errors and performance issues in
R.

data.table doesn't do that internally, and it's syntax is pretty easy.
> tmp <- data.table(index = gl(2,20), foo = rnorm(40))
> tmp[, .SD[head(order(-foo),5)], by=index]      index index.1       foo
 [1,]     1       1 1.9677303
 [2,]     1       1 1.2731872
 [3,]     1       1 1.1100931
 [4,]     1       1 0.8194719
 [5,]     1       1 0.6674880
 [6,]     2       2 1.2236383
 [7,]     2       2 0.9606766
 [8,]     2       2 0.8654497
 [9,]     2       2 0.5404112
[10,]     2       2 0.3373457> 
As you can see it currently repeats the group column which is a
shame (on the to do list to fix).

Matthew

http://datatable.r-forge.r-project.org/


-- 
View this message in context:
http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Sep 2010 - Sorting and subsetting

[R] Sorting and subsetting

[R] Sorting and subsetting

[R] Sorting and subsetting

[R] Sorting and subsetting

[R] Sorting and subsetting

[R] Sorting and subsetting

Possibly Parallel Threads