take a look at using the 'data.table' package. Here are some times to
do the lookup using dataframes, matrices and data.tables: data.tables
give the answer is less than 0.1 seconds.
> str(x.df)
'data.frame': 2500000 obs. of 4 variables:
$ x : Factor w/ 455063 levels "AAAA","AAAB",..: 200683
388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
$ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 200683
388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
$ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 200683
388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
$ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 200683
388992 241029
305994 209907 112469 105656 233058 247529 416273 ...> system.time(a <- x.df[[1]] %in% "AAAA")
user system elapsed
0.33 0.00 0.39> x.m <- as.matrix(x.df)
> str(x.m)
chr [1:2500000, 1:4] "LMDC" "WFXC" "NUBQ"
"RMOK" "LZVR" "GLCE" "GAZE"
"NIFT" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "x" "x.1" "x.2"
"x.3"> system.time(a <- x.m[,1] %in% "AAAA")
user system elapsed
0.50 0.00 0.51> require(data.table)
> x.df <- data.table(x.df)
> setkey(x.df, x)
> system.time(a <- x.df["AAAA"])
user system elapsed
0.05 0.03 0.13> str(a)
Classes ?data.table? and 'data.frame': 7 obs. of 4 variables:
$ x : Factor w/ 1 level "AAAA": 1 1 1 1 1 1 1
$ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1
1 1
$ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1
1 1
$ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1
1 1
- attr(*, "sorted")= chr "x"> system.time(x.df["ABCD"])
user system elapsed
0.08 0.02 0.16>
On Tue, Nov 22, 2011 at 2:01 PM, TimothyDalbey <tmdalbey at gmail.com>
wrote:> Hey All,
>
> So - I promise to write a blog post on this topic and post it somewhere on
> the internet once I get to the bottom of this. ?Basically, the set-up to
the
> problem is like this:
>
> 1. ?I have a data frame with dim (2547290, 4)
> 2. ?I need to make SQL like lookups on the dataframe. ?I have been using
the
> following sort of syntax:
>
> a.dataframe[a.dataframe[[column_index]] %in% some_value, ]
>
> 3. ?This process takes quite a lot of time (~2 seconds) on m1.small
> instances AMIs (AWS)
>
> So, I hope I can get that look-up/search logic quite a lot faster. ?I have
> heard that using matrices is the way to do it but I haven't found any
> resources on performing that sort of operation specifically that have
> yielded better results.
>
> Thought, feelings and advice are more than welcome.
>
> Best,
> TMD
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Data-Frame-Search-Slow-tp4096906p4096906.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.