thr3ads.net - R help - [R] data frame subset too slow [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Duke

2010-Dec-30 15:23 UTC

[R] data frame subset too slow

Hi all,

First I dont have much experience with R so be gentle. OK, I am dealing 
with a dataset (~ tens of thousand lines, each line ~ 10 columns of 
data). I have to create some subset of this data based on some certain 
conditions (for example, same first column with another dataset etc...). 
Here is how I did it:

# import data
dat <- read.table( "test.txt", header=TRUE, fill=TRUE,
sep="\t" )
list <- read.table( "list.txt", header=TRUE, fill=TRUE,
sep="\t" )
# create sub data
subdat <- dat[dat[1] %in% list[1],]

So the third line is to create a new data frame with all the same first 
column in both dat and list. There is no problem with the code as it 
runs just fine with testing data (small). When I tried with my real data 
(~80k lines, ~ 15MB size), it takes like forever (few hours). I dont 
know why it takes that long, but I think it shouldnt. I think even with 
a for loop in C++, I can get this done in say few minutes.

So anyone has any idea/advice/suggestion?

Thanks so much in advance and Happy New Year to all of you.

D.

jim holtman

2010-Dec-30 16:13 UTC

head link

[R] data frame subset too slow

You should be using dat[[1]].  Here is an example with 80000 rows that
take about 0.02 seconds to get the subset.

Provide an 'str' of what your data looks like
> n <- 80000  # rows to create
> dat <- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n),
runif(n))
> lst <- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n),
runif(n))
> str(dat)'data.frame':   80000 obs. of  5 variables:
 $ sample.1.200..n..TRUE.: int  39 116 69 163 51 125 144 32 28 4 ...
 $ runif.n.              : num  0.519 0.793 0.549 0.77 0.272 ...
 $ runif.n..1            : num  0.691 0.89 0.783 0.467 0.357 ...
 $ runif.n..2            : num  0.705 0.254 0.584 0.998 0.279 ...
 $ runif.n..3            : num  0.873 1 0.678 0.702 0.455
...> str(lst)'data.frame':   80000 obs. of  5 variables:
 $ sample.1.100..n..TRUE.: int  38 83 38 70 77 44 81 55 32 1 ...
 $ runif.n.              : num  0.0621 0.7374 0.074 0.4281 0.0516 ...
 $ runif.n..1            : num  0.879 0.294 0.146 0.884 0.58 ...
 $ runif.n..2            : num  0.648 0.745 0.825 0.507 0.799 ...
 $ runif.n..3            : num  0.2523 0.1679 0.9728 0.0478 0.0967
...> system.time({+ dat.sub <- dat[dat[[1]] %in% lst[[1]],]
+ })
   user  system elapsed
   0.02    0.00    0.01> str(dat.sub)'data.frame':   39803 obs. of  5 variables:
 $ sample.1.200..n..TRUE.: int  39 69 51 32 28 4 69 3 48 69 ...
 $ runif.n.              : num  0.5188 0.5494 0.2718 0.5566 0.0893 ...
 $ runif.n..1            : num  0.691 0.783 0.357 0.619 0.717 ...
 $ runif.n..2            : num  0.705 0.584 0.279 0.789 0.192 ...
 $ runif.n..3            : num  0.873 0.678 0.455 0.843 0.383
...>
On Thu, Dec 30, 2010 at 10:23 AM, Duke <duke.lists at gmx.com>
wrote:> Hi all,
>
> First I dont have much experience with R so be gentle. OK, I am dealing
with
> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I
have
> to create some subset of this data based on some certain conditions (for
> example, same first column with another dataset etc...). Here is how I did
> it:
>
> # import data
> dat <- read.table( "test.txt", header=TRUE, fill=TRUE,
sep="\t" )
> list <- read.table( "list.txt", header=TRUE, fill=TRUE,
sep="\t" )
> # create sub data
> subdat <- dat[dat[1] %in% list[1],]
>
> So the third line is to create a new data frame with all the same first
> column in both dat and list. There is no problem with the code as it runs
> just fine with testing data (small). When I tried with my real data (~80k
> lines, ~ 15MB size), it takes like forever (few hours). I dont know why it
> takes that long, but I think it shouldnt. I think even with a for loop in
> C++, I can get this done in say few minutes.
>
> So anyone has any idea/advice/suggestion?
>
> Thanks so much in advance and Happy New Year to all of you.
>
> D.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Possibly Parallel Threads

Search for more reasonably related threads

R help - Dec 2010 - data frame subset too slow

[R] data frame subset too slow

[R] data frame subset too slow

Possibly Parallel Threads