thr3ads.net - R help - [R] Performance tuning tips when working with wide datasets [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Richard Vlasimsky

2010-Nov-24 11:51 UTC

[R] Performance tuning tips when working with wide datasets

Does anyone have any performance tuning tips when working with datasets that are
extremely wide (e.g. 20,000 columns)?

In particular, I am trying to perform a merge like below:

merged_data <- merge(data1, data2,
by.x="date",by.y="date",all=TRUE,sort=TRUE);

This statement takes about 8 hours to execute on a pretty fast machine.  The
dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25
columns.  The dataset data2 contains annual data (only 60 observations), however
there are lots of columns (20,000 of them).

I have to do a lot of these kinds of merges so need to figure out a way to speed
it up.

I have tried  a number of different things to speed things up to no avail. 
I've noticed that rbinds execute much faster using matrices than dataframes.
However the performance improvement when using matrices (vs. data frames) on
merges were negligible (8 hours down to 7).  I tried casting my merge field
(date) into various different data types (character, factor, date).  This
didn't seem to have any effect. I tried the hash package, however, merge
couldn't coerce the class into a data.frame.  I've tried various ways to
parellelize computation in the past, and found that to be problematic for a
variety of reasons (runaway forked processes, doesn't run in a GUI
environment, doesn't run on Macs, etc.).

I'm starting to run out of ideas, anyone?  Merging a 60 row dataset
shouldn't take that long.

Thanks,
Richard

Claudia Beleites

2010-Nov-24 12:23 UTC

head link

[R] Performance tuning tips when working with wide datasets

Dear Richard,
> Does anyone have any performance tuning tips when working with datasets
that
> are extremely wide (e.g. 20,000 columns)?The obvious one is: use matrices ? and take care that they don't get
converted
back to data.frames.
> In particular, I am trying to perform a merge like below:
>
> merged_data<- merge(data1, data2,
> by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine. 
The
> dataset data1 contains daily data going back to 1950 (20,000 rows) and has
25
> columns.  The dataset data2 contains annual data (only 60 observations),
> however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to
> speed it up.
>
> I have tried  a number of different things to speed things up to no avail.
> I've noticed that rbinds execute much faster using matrices than
dataframes.
> However the performance improvement when using matrices (vs. data frames)
on
> merges were negligible (8 hours down to 7).which is astonishing, as merge (matrix) uses merge.default, which boils down to
merge(as.data.frame(x), as.data.frame(y), ...)
>  I tried casting my merge field
> (date) into various different data types (character, factor, date).  This
> didn't seem to have any effect. I tried the hash package, however,
merge
> couldn't coerce the class into a data.frame.  I've tried various
ways to
> parellelize computation in the past, and found that to be problematic for a
> variety of reasons (runaway forked processes, doesn't run in a GUI
> environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset
shouldn't
> take that long.
Do I understand correctly that the result should be a 20000 x 20025 matrix,
where the additional 25 columns are from data2 and end up in the rows of e.g.
every 1st of January?

In that case, you may be much faster producing tmp <- matrix (NA, 20000,
20000),
fill the values of data2 into the correct rows, and then cbind data1 and tmp.
Make sure you have enough RAM available: tmp is about 1.5 GB. If you manage to
do this without swapping, it should be reasonably fast.

If you end up writing a proper merge function for matrics, please let me know:
I'd be interested in using it...

Claudia

> Thanks, Richard ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting
guide
> http://www.R-project.org/posting-guide.html and provide commented, minimal,
> self-contained, reproducible code.

-- 
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Universit? degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites at units.it

Prof Brian Ripley

2010-Nov-24 12:43 UTC

head link

[R] Performance tuning tips when working with wide datasets

On Wed, 24 Nov 2010, Richard Vlasimsky wrote:
> Does anyone have any performance tuning tips when working with 
> datasets that are extremely wide (e.g. 20,000 columns)?
Don't use data frames.
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2,
by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast 
> machine.  The dataset data1 contains daily data going back to 1950 
> (20,000 rows) and has 25 columns.  The dataset data2 contains annual 
> data (only 60 observations), however there are lots of columns 
> (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a 
> way to speed it up.
>
> I have tried a number of different things to speed things up to no 
> avail.  I've noticed that rbinds execute much faster using matrices 
> than dataframes.  However the performance improvement when using 
> matrices (vs. data frames) on merges were negligible (8 hours down 
> to 7).  I tried casting my merge field (date) into various different 
> data types (character, factor, date).  This didn't seem to have any 
> effect. I tried the hash package, however, merge couldn't coerce the 
> class into a data.frame.  I've tried various ways to parellelize 
> computation in the past, and found that to be problematic for a 
> variety of reasons (runaway forked processes, doesn't run in a GUI 
> environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset 
> shouldn't take that long.
Correct, but that is not what you said you are doing.

Why do you need to do the merge at all?  Sounds like you simply need 
to create a suitable index into the rows of the 20,000 col data frame, 
and use that appropriately.  E.g merge data1 with a data2 containing 
the year and a column index=1:60.

date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1)
year <- 1900+as.POSIXlt(date)$year
data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25))
fdata2 <- data.frame(year=1950:2010, index=1:61)
> system.time(m <- merge(data1, fdata2, by='year', all=TRUE,
sort=TRUE))    user  system elapsed
   0.226   0.011   0.237

and use m$index to index the real data2.

We don't have an example, but at the least you are ending up with a 
data frame with 20,000 rows and ca 20,000 cols.  That's not a small 
object.
> Thanks,
> Richard
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Gabor Grothendieck

2010-Nov-24 12:54 UTC

head link

[R] Performance tuning tips when working with wide datasets

On Wed, Nov 24, 2010 at 6:51 AM, Richard Vlasimsky
<richard.vlasimsky at imidex.com> wrote:>
> Does anyone have any performance tuning tips when working with datasets
that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2,
by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine.
?The dataset data1 contains daily data going back to 1950 (20,000 rows) and has
25 columns. ?The dataset data2 contains annual data (only 60 observations),
however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to
speed it up.
>
> I have tried ?a number of different things to speed things up to no avail.
?I've noticed that rbinds execute much faster using matrices than
dataframes. ?However the performance improvement when using matrices (vs. data
frames) on merges were negligible (8 hours down to 7). ?I tried casting my merge
field (date) into various different data types (character, factor, date). ?This
didn't seem to have any effect. I tried the hash package, however, merge
couldn't coerce the class into a data.frame. ?I've tried various ways to
parellelize computation in the past, and found that to be problematic for a
variety of reasons (runaway forked processes, doesn't run in a GUI
environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone? ?Merging a 60 row dataset
shouldn't take that long.
>
I don't know for sure if its faster but you could try using sqldf
since it won't rely on R facilities to do the join.  See the examples
here::

http://code.google.com/p/sqldf/#Example_4._Join

and if that is still too slow try adding indexes as illustrated in
examples 4h and 4i there.

Its very easy to try since its just one sql statement to do a join and
one to index each table if you need that.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Matthew Dowle

2010-Nov-24 16:15 UTC

head link

[R] Performance tuning tips when working with wide datasets

Richard,

Try data.table. See the introduction vignette and the
presentations e.g. there is a slide showing a join to
183,000,000 observations of daily stock prices in
0.002 seconds.

data.table has fast rolling joins (i.e. fast last observation
carried forward) too. I see you asked about that on
this list on 8 Nov. Also see fast aggregations using 'by'
on a key()-ed in-memory table.

I wonder if your 20,000 columns are always
populated for all rows. If not then consider collapsing
to a 3 column table (row,col,data) and then
joining to that. You may have that format in your
original data source anyway, so you may be able
to skip a step you may have implemented already
which expands that format to wide. In other words,
keeping it narrow may be an option (like how a sparse
matrix is stored).

Matthew

http://datatable.r-forge.r-project.org/



"Richard Vlasimsky" <richard.vlasimsky at imidex.com> wrote in
message
news:2E042129-4430-4C66-9308-A36B761EBBEB at
imidex.com...>
> Does anyone have any performance tuning tips when working with datasets 
> that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2, 
> by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine. 
> The dataset data1 contains daily data going back to 1950 (20,000 rows) and 
> has 25 columns.  The dataset data2 contains annual data (only 60 
> observations), however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to 
> speed it up.
>
> I have tried  a number of different things to speed things up to no avail. 
> I've noticed that rbinds execute much faster using matrices than 
> dataframes.  However the performance improvement when using matrices (vs. 
> data frames) on merges were negligible (8 hours down to 7).  I tried 
> casting my merge field (date) into various different data types 
> (character, factor, date).  This didn't seem to have any effect. I
tried
> the hash package, however, merge couldn't coerce the class into a 
> data.frame.  I've tried various ways to parellelize computation in the 
> past, and found that to be problematic for a variety of reasons (runaway 
> forked processes, doesn't run in a GUI environment, doesn't run on
Macs,
> etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset 
> shouldn't take that long.
>
> Thanks,
> Richard

Andreas Borg

2010-Nov-29 11:50 UTC

head link

[R] Performance tuning tips when working with wide datasets

Richard Vlasimsky schrieb:> Does anyone have any performance tuning tips when working with datasets
that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2,
by.x=ate",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine. 
The dataset data1 contains daily data going back to 1950 (20,000 rows) and has
25 columns.  The dataset data2 contains annual data (only 60 observations),
however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to
speed it up.
>
> I have tried  a number of different things to speed things up to no avail. 
I've noticed that rbinds execute much faster using matrices than dataframes.
However the performance improvement when using matrices (vs. data frames) on
merges were negligible (8 hours down to 7).  I tried casting my merge field
(date) into various different data types (character, factor, date).  This
didn't seem to have any effect. I tried the hash package, however, merge
couldn't coerce the class into a data.frame.  I've tried various ways to
parellelize computation in the past, and found that to be problematic for a
variety of reasons (runaway forked processes, doesn't run in a GUI
environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset
shouldn't take that long.
>
> Thanks,
> Richard
>
>   
Hi Richard,

I had similar problems (even with much less data) and found out that 
most of the running time was caused by memory swapping instead of CPU 
usage. If you do not need all of the merged data at once, block-wise 
processing can help, which means that you only generate that much merged 
data at once as fits into main memory. I ended up using package RSQLite 
(an embedded database) in the following way:

-create a database connection (explained in the package docs)
-copy data to database tables via dbWriteTable()
-create indices on the columns which are used for merging, sth. like: 
"dbGetQuery(con, 'create index index_year on table2(year)')"
-> this
speeds up    joining significantly
-construct an SQL query to do the join / merge operation and send it to 
SQLite via dbSendQuery()
-retreive the result in blocks of reasonable size with fetch()

Unless there is an operation in the query which requires SQLite to 
process the whole result (e.g. sorting), the result rows will be created 
on the fly for every call of fetch() instead of a huge table being 
allocated in addition to the original data.

I am not sure if this works with other database engines (there are a 
couple of database interfaces on CRAN); when I tried to use RPostgreSQL, 
it created the whole result set at once, leading to the same memory 
problem. Maybe that behavior can be changed by some config variable.

Best regards,

Andreas

-- 
Andreas Borg
Medizinische Informatik

UNIVERSIT?TSMEDIZIN
der Johannes Gutenberg-Universit?t
Institut f?r Medizinische Biometrie, Epidemiologie und Informatik
Obere Zahlbacher Stra?e 69, 55131 Mainz
www.imbei.uni-mainz.de

Telefon +49 (0) 6131 175062
E-Mail: borg at imbei.uni-mainz.de

Diese E-Mail enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen.
Wenn Sie nicht der
richtige Adressat sind oder diese E-Mail irrt?mlich erhalten haben, informieren
Sie bitte sofort den
Absender und l?schen Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe
dieser Mail und der darin enthaltenen Informationen ist nicht gestattet.

Maybe Matching Threads

Search for more apparently analagous threads

R help - Nov 2010 - Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

[R] Performance tuning tips when working with wide datasets

Maybe Matching Threads