thr3ads.net - R help - [R] Fastest way to repeatedly subset a data frame? [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Iestyn Lewis

2007-Apr-20 16:29 UTC

[R] Fastest way to repeatedly subset a data frame?

Hi -

 I have a data frame with a large number of observations (62,000 rows, 
but only 2 columns - a character ID and a result list). 

Sample:

 > my.df <- data.frame(id=c("ID1", "ID2",
"ID3"), result=1:3)
 > my.df
   id result
1 ID1      1
2 ID2      2
3 ID3      3

I have a list of ID vectors.  This list will have anywhere from 100 to 
1000 members, and each member will have anywhere from 10 to 5000 id entries.

Sample:

 > my.idlist[["List1"]] <- c("ID1", "ID3")
 > my.idlist[["List2"]] <- c("ID2")
 > my.idlist
$List1
[1] "ID1" "ID3"

$List2
[1] "ID2"


I need to subset that data frame by the list of IDs in each vector, to 
end up with vectors that contain just the results for the IDs found in 
each vector in the list.  My current approach is to create new columns 
in the original data frame with the names of the list items, and any 
results that don't match replaced with NA.  Here is what I've done so
far:

createSubsets <- function(res, slib) {
    for(i in 1:length(slib)) {
        res[ ,names(slib)[i]] <- replace(res$result, 
which(!is.element(res$sid, slib[[i]])), NA)
        return (res)
    }
}

I have 2 problems:

1)  My function only works for the first item in the list:

 > my.df <- createSubsets(my.df, my.idlist)
 > my.df
   id result List1
1 ID1      1     1
2 ID2      2    NA
3 ID3      3     3

In order to get all results, I have to copy the loop out of the function 
and paste it into R directly.

2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list 
entries, it takes probably 5 minutes on a pentium D.  An implementation 
of this kind of subsetting using hashtables in C# takes a neglible 
amount of time. 

I am open to any suggestions about data format, methods, anything. 

Thanks,

Iestyn Lewis
Emory University

Iestyn Lewis

2007-Apr-20 18:03 UTC

head link

[R] Fastest way to repeatedly subset a data frame?

Hi Phil -

Sadly, although your syntax is certainly a lot cleaner and more elegant 
than mine, the elapsed time is about the same.  5 minutes may have been 
an exaggeration, but we are looking at a timescale of minutes, where the 
C# hashtable method was under a second. 

I have a feeling the inherent problem in both of our approaches is the 
use of is.element and %in%, both of which operate over vectors.  

Maybe what really needs to happen is each vector of Ids needs to be 
converted to a list - does anyone know if the R implementation of named 
lists is similar to a Hashtable like you'd find in perl or C# or 
whatever?  IE, is searching for membership in an  R list faster than 
looking for an element in a vector?

Thanks,

Iestyn

Phil Spector wrote:> Iestyn -
>    Don't know if this is the fastest, but I suspect it will be
> quite a bit faster than your current method:
>
> makecol = function(x,df=my.df)replace(df$result,!df$id %in% x,NA)
> result = cbind(my.df,sapply(my.idlist,makecol))
>
>                                        - Phil Spector
>                      Statistical Computing Facility
>                      Department of Statistics
>                      UC Berkeley
>                      spector at stat.berkeley.edu
>
>
> On Fri, 20 Apr 2007, Iestyn Lewis wrote:
>
>> Hi -
>>
>> I have a data frame with a large number of observations (62,000 rows,
>> but only 2 columns - a character ID and a result list).
>>
>> Sample:
>>
>> > my.df <- data.frame(id=c("ID1", "ID2",
"ID3"), result=1:3)
>> > my.df
>>   id result
>> 1 ID1      1
>> 2 ID2      2
>> 3 ID3      3
>>
>> I have a list of ID vectors.  This list will have anywhere from 100 to
>> 1000 members, and each member will have anywhere from 10 to 5000 id 
>> entries.
>>
>> Sample:
>>
>> > my.idlist[["List1"]] <- c("ID1",
"ID3")
>> > my.idlist[["List2"]] <- c("ID2")
>> > my.idlist
>> $List1
>> [1] "ID1" "ID3"
>>
>> $List2
>> [1] "ID2"
>>
>>
>> I need to subset that data frame by the list of IDs in each vector, to
>> end up with vectors that contain just the results for the IDs found in
>> each vector in the list.  My current approach is to create new columns
>> in the original data frame with the names of the list items, and any
>> results that don't match replaced with NA.  Here is what I've
done so
>> far:
>>
>> createSubsets <- function(res, slib) {
>>    for(i in 1:length(slib)) {
>>        res[ ,names(slib)[i]] <- replace(res$result,
>> which(!is.element(res$sid, slib[[i]])), NA)
>>        return (res)
>>    }
>> }
>>
>> I have 2 problems:
>>
>> 1)  My function only works for the first item in the list:
>>
>> > my.df <- createSubsets(my.df, my.idlist)
>> > my.df
>>   id result List1
>> 1 ID1      1     1
>> 2 ID2      2    NA
>> 3 ID3      3     3
>>
>> In order to get all results, I have to copy the loop out of the
function
>> and paste it into R directly.
>>
>> 2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list
>> entries, it takes probably 5 minutes on a pentium D.  An implementation
>> of this kind of subsetting using hashtables in C# takes a neglible
>> amount of time.
>>
>> I am open to any suggestions about data format, methods, anything.
>>
>> Thanks,
>>
>> Iestyn Lewis
>> Emory University
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

hadley wickham

2007-Apr-20 18:48 UTC

head link

[R] Fastest way to repeatedly subset a data frame?

On 4/20/07, Iestyn Lewis <ilewis at pharm.emory.edu>
wrote:> Hi -
>
>  I have a data frame with a large number of observations (62,000 rows,
> but only 2 columns - a character ID and a result list).
>
> Sample:
>
>  > my.df <- data.frame(id=c("ID1", "ID2",
"ID3"), result=1:3)
>  > my.df
>    id result
> 1 ID1      1
> 2 ID2      2
> 3 ID3      3
>
> I have a list of ID vectors.  This list will have anywhere from 100 to
> 1000 members, and each member will have anywhere from 10 to 5000 id
entries.
>
> Sample:
>
>  > my.idlist[["List1"]] <- c("ID1",
"ID3")
>  > my.idlist[["List2"]] <- c("ID2")
>  > my.idlist
> $List1
> [1] "ID1" "ID3"
>
> $List2
> [1] "ID2"
>
>
> I need to subset that data frame by the list of IDs in each vector, to
> end up with vectors that contain just the results for the IDs found in
> each vector in the list.  My current approach is to create new columns
> in the original data frame with the names of the list items, and any
> results that don't match replaced with NA.  Here is what I've done
so far:
>
> createSubsets <- function(res, slib) {
>     for(i in 1:length(slib)) {
>         res[ ,names(slib)[i]] <- replace(res$result,
> which(!is.element(res$sid, slib[[i]])), NA)
>         return (res)
>     }
> }
>
> I have 2 problems:
>
> 1)  My function only works for the first item in the list:
>
>  > my.df <- createSubsets(my.df, my.idlist)
>  > my.df
>    id result List1
> 1 ID1      1     1
> 2 ID2      2    NA
> 3 ID3      3     3
>
> In order to get all results, I have to copy the loop out of the function
> and paste it into R directly.
>
> 2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list
> entries, it takes probably 5 minutes on a pentium D.  An implementation
> of this kind of subsetting using hashtables in C# takes a neglible
> amount of time.
>
> I am open to any suggestions about data format, methods, anything.
How about:

df <- data.frame(id=c("ID1", "ID2", "ID3"),
result=1:3)

ids <- list()
ids[["List1"]] <- c("ID1", "ID3")
ids[["List2"]] <- c("ID2")

rownames(df) <- df$id
lapply(ids, function(id) df[id, ])

Hadley

Possibly Parallel Threads

Search for more maybe matching threads

R help - Apr 2007 - Fastest way to repeatedly subset a data frame?

[R] Fastest way to repeatedly subset a data frame?

[R] Fastest way to repeatedly subset a data frame?

[R] Fastest way to repeatedly subset a data frame?

Possibly Parallel Threads