thr3ads.net - R help - [R] fast or space-efficient lookup? [Oct 2011]

If this information is useful, please help other people find it:
Share via:

ivo welch

2011-Oct-09 16:31 UTC

[R] fast or space-efficient lookup?

Dear R experts---I am struggling with memory and speed issues.  Advice
would be appreciated.

I have a long data set (of financial stock returns, with stock name
and trading day).  All three variables, stock return, id and day, are
irregular.  About 1.3GB in object.size (200MB on disk).  now, I need
to merge the main data set with some aggregate data (e.g., the S&P500
market rate of return, with a day index) from the same day.  this
"market data set" is not a big data set (object.size=300K, 5 columns,
12000 rows).

let's say my (dumb statistical) plan is to run one grand regression,
where the individual rate of return is y and the market rate of return
is x.  the following should work without a problem:

combined <- merge( main, aggregate.data, by="day", all.x=TRUE,
all.y=FALSE )
lm( stockreturn ~ marketreturn, data=combined )

alas, the merge is neither space-efficient nor fast.  in fact, I run
out of memory on my 16GB linux machine.  my guess is that by whittling
it down, I could work it (perhaps doing it in chunks, and then
rbinding it), but this is painful.

in perl, I would define a hash with the day as key and the market
return as value, and then loop over the main data set to supplement
it.

is there a recommended way of doing such tasks in R, either super-fast
(so that I merge many many times) or space efficient (so that I merge
once and store the results)?

sincerely,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)

Patrick Burns

2011-Oct-09 17:42 UTC

head link

[R] fast or space-efficient lookup?

I think you are looking for the 'data.table'
package.

On 09/10/2011 17:31, ivo welch wrote:> Dear R experts---I am struggling with memory and speed issues.  Advice
> would be appreciated.
>
> I have a long data set (of financial stock returns, with stock name
> and trading day).  All three variables, stock return, id and day, are
> irregular.  About 1.3GB in object.size (200MB on disk).  now, I need
> to merge the main data set with some aggregate data (e.g., the S&P500
> market rate of return, with a day index) from the same day.  this
> "market data set" is not a big data set (object.size=300K, 5
columns,
> 12000 rows).
>
> let's say my (dumb statistical) plan is to run one grand regression,
> where the individual rate of return is y and the market rate of return
> is x.  the following should work without a problem:
>
> combined<- merge( main, aggregate.data, by="day", all.x=TRUE,
all.y=FALSE )
> lm( stockreturn ~ marketreturn, data=combined )
>
> alas, the merge is neither space-efficient nor fast.  in fact, I run
> out of memory on my 16GB linux machine.  my guess is that by whittling
> it down, I could work it (perhaps doing it in chunks, and then
> rbinding it), but this is painful.
>
> in perl, I would define a hash with the day as key and the market
> return as value, and then loop over the main data set to supplement
> it.
>
> is there a recommended way of doing such tasks in R, either super-fast
> (so that I merge many many times) or space efficient (so that I merge
> once and store the results)?
>
> sincerely,
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Patrick Burns
pburns at pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

Steve Lianoglou

2011-Oct-10 15:23 UTC

head link

[R] fast or space-efficient lookup?

Hi Ivo,

On Mon, Oct 10, 2011 at 10:58 AM, ivo welch <ivo.welch at gmail.com>
wrote:> hi steve---agreed...but is there any other computer language in which
> an expression in a [ . ] is anything except a tensor index selector?
Sure, it's a type specifier in scala generics:
http://www.scala-lang.org/node/113

Something similar to "scale-eez" in haskell.

Aslo, MATLAB (ugh) it's not even a tensor selector (they use
"normal"
parens there).

But I'm not sure what that has to do w/ the price of tea in china.

With data.table, "[" still is "tensor-selector" like,
though. You can
just pass in another data.table to use as the "keys" to do your
selection through the `i` argument (like "selecting rows"), which I
guess will likely be your most common use case if you're moving to
data.table (presumably you are trying to take advantage of its
quickness over big-table-like objects.

You can use the `j` param to further manipulate columns. If you pass
in a data.table as `i`, it will add its columns to `j`.

I'll grant you that it is different than your standard "rectangular
object" selection in R, but the motivation isn't "so strange"
as both
i,j params in normal calls to 'xxx[i,j]' are for selecting (ok not
manipulating) rows and columns on other "rectangular" like objects,
too.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Oct 2011 - fast or space-efficient lookup?

[R] fast or space-efficient lookup?

[R] fast or space-efficient lookup?

[R] fast or space-efficient lookup?

Seemingly Similar Threads