You should be using dat[[1]]. Here is an example with 80000 rows that
take about 0.02 seconds to get the subset.
Provide an 'str' of what your data looks like
> n <- 80000 # rows to create
> dat <- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n),
runif(n))
> lst <- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n),
runif(n))
> str(dat)
'data.frame': 80000 obs. of 5 variables:
$ sample.1.200..n..TRUE.: int 39 116 69 163 51 125 144 32 28 4 ...
$ runif.n. : num 0.519 0.793 0.549 0.77 0.272 ...
$ runif.n..1 : num 0.691 0.89 0.783 0.467 0.357 ...
$ runif.n..2 : num 0.705 0.254 0.584 0.998 0.279 ...
$ runif.n..3 : num 0.873 1 0.678 0.702 0.455
...> str(lst)
'data.frame': 80000 obs. of 5 variables:
$ sample.1.100..n..TRUE.: int 38 83 38 70 77 44 81 55 32 1 ...
$ runif.n. : num 0.0621 0.7374 0.074 0.4281 0.0516 ...
$ runif.n..1 : num 0.879 0.294 0.146 0.884 0.58 ...
$ runif.n..2 : num 0.648 0.745 0.825 0.507 0.799 ...
$ runif.n..3 : num 0.2523 0.1679 0.9728 0.0478 0.0967
...> system.time({
+ dat.sub <- dat[dat[[1]] %in% lst[[1]],]
+ })
user system elapsed
0.02 0.00 0.01> str(dat.sub)
'data.frame': 39803 obs. of 5 variables:
$ sample.1.200..n..TRUE.: int 39 69 51 32 28 4 69 3 48 69 ...
$ runif.n. : num 0.5188 0.5494 0.2718 0.5566 0.0893 ...
$ runif.n..1 : num 0.691 0.783 0.357 0.619 0.717 ...
$ runif.n..2 : num 0.705 0.584 0.279 0.789 0.192 ...
$ runif.n..3 : num 0.873 0.678 0.455 0.843 0.383
...>
On Thu, Dec 30, 2010 at 10:23 AM, Duke <duke.lists at gmx.com>
wrote:> Hi all,
>
> First I dont have much experience with R so be gentle. OK, I am dealing
with
> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I
have
> to create some subset of this data based on some certain conditions (for
> example, same first column with another dataset etc...). Here is how I did
> it:
>
> # import data
> dat <- read.table( "test.txt", header=TRUE, fill=TRUE,
sep="\t" )
> list <- read.table( "list.txt", header=TRUE, fill=TRUE,
sep="\t" )
> # create sub data
> subdat <- dat[dat[1] %in% list[1],]
>
> So the third line is to create a new data frame with all the same first
> column in both dat and list. There is no problem with the code as it runs
> just fine with testing data (small). When I tried with my real data (~80k
> lines, ~ 15MB size), it takes like forever (few hours). I dont know why it
> takes that long, but I think it shouldnt. I think even with a for loop in
> C++, I can get this done in say few minutes.
>
> So anyone has any idea/advice/suggestion?
>
> Thanks so much in advance and Happy New Year to all of you.
>
> D.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?