Richard Vlasimsky
2010-Nov-24 11:51 UTC
[R] Performance tuning tips when working with wide datasets
Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)? In particular, I am trying to perform a merge like below: merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE); This statement takes about 8 hours to execute on a pretty fast machine. The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them). I have to do a lot of these kinds of merges so need to figure out a way to speed it up. I have tried a number of different things to speed things up to no avail. I've noticed that rbinds execute much faster using matrices than dataframes. However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7). I tried casting my merge field (date) into various different data types (character, factor, date). This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame. I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.). I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't take that long. Thanks, Richard
Claudia Beleites
2010-Nov-24 12:23 UTC
[R] Performance tuning tips when working with wide datasets
Dear Richard,> Does anyone have any performance tuning tips when working with datasets that > are extremely wide (e.g. 20,000 columns)?The obvious one is: use matrices ? and take care that they don't get converted back to data.frames.> In particular, I am trying to perform a merge like below: > > merged_data<- merge(data1, data2, > by.x="date",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast machine. The > dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 > columns. The dataset data2 contains annual data (only 60 observations), > however there are lots of columns (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a way to > speed it up. > > I have tried a number of different things to speed things up to no avail. > I've noticed that rbinds execute much faster using matrices than dataframes. > However the performance improvement when using matrices (vs. data frames) on > merges were negligible (8 hours down to 7).which is astonishing, as merge (matrix) uses merge.default, which boils down to merge(as.data.frame(x), as.data.frame(y), ...)> I tried casting my merge field > (date) into various different data types (character, factor, date). This > didn't seem to have any effect. I tried the hash package, however, merge > couldn't coerce the class into a data.frame. I've tried various ways to > parellelize computation in the past, and found that to be problematic for a > variety of reasons (runaway forked processes, doesn't run in a GUI > environment, doesn't run on Macs, etc.). > > I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't > take that long.Do I understand correctly that the result should be a 20000 x 20025 matrix, where the additional 25 columns are from data2 and end up in the rows of e.g. every 1st of January? In that case, you may be much faster producing tmp <- matrix (NA, 20000, 20000), fill the values of data2 into the correct rows, and then cbind data1 and tmp. Make sure you have enough RAM available: tmp is about 1.5 GB. If you manage to do this without swapping, it should be reasonably fast. If you end up writing a proper merge function for matrics, please let me know: I'd be interested in using it... Claudia> Thanks, Richard ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code.-- Claudia Beleites Dipartimento dei Materiali e delle Risorse Naturali Universit? degli Studi di Trieste Via Alfonso Valerio 6/a I-34127 Trieste phone: +39 0 40 5 58-37 68 email: cbeleites at units.it
Prof Brian Ripley
2010-Nov-24 12:43 UTC
[R] Performance tuning tips when working with wide datasets
On Wed, 24 Nov 2010, Richard Vlasimsky wrote:> Does anyone have any performance tuning tips when working with > datasets that are extremely wide (e.g. 20,000 columns)?Don't use data frames.> In particular, I am trying to perform a merge like below: > > merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast > machine. The dataset data1 contains daily data going back to 1950 > (20,000 rows) and has 25 columns. The dataset data2 contains annual > data (only 60 observations), however there are lots of columns > (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a > way to speed it up. > > I have tried a number of different things to speed things up to no > avail. I've noticed that rbinds execute much faster using matrices > than dataframes. However the performance improvement when using > matrices (vs. data frames) on merges were negligible (8 hours down > to 7). I tried casting my merge field (date) into various different > data types (character, factor, date). This didn't seem to have any > effect. I tried the hash package, however, merge couldn't coerce the > class into a data.frame. I've tried various ways to parellelize > computation in the past, and found that to be problematic for a > variety of reasons (runaway forked processes, doesn't run in a GUI > environment, doesn't run on Macs, etc.). > > I'm starting to run out of ideas, anyone? Merging a 60 row dataset > shouldn't take that long.Correct, but that is not what you said you are doing. Why do you need to do the merge at all? Sounds like you simply need to create a suitable index into the rows of the 20,000 col data frame, and use that appropriately. E.g merge data1 with a data2 containing the year and a column index=1:60. date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1) year <- 1900+as.POSIXlt(date)$year data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25)) fdata2 <- data.frame(year=1950:2010, index=1:61)> system.time(m <- merge(data1, fdata2, by='year', all=TRUE, sort=TRUE))user system elapsed 0.226 0.011 0.237 and use m$index to index the real data2. We don't have an example, but at the least you are ending up with a data frame with 20,000 rows and ca 20,000 cols. That's not a small object.> Thanks, > Richard > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Gabor Grothendieck
2010-Nov-24 12:54 UTC
[R] Performance tuning tips when working with wide datasets
On Wed, Nov 24, 2010 at 6:51 AM, Richard Vlasimsky <richard.vlasimsky at imidex.com> wrote:> > Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)? > > In particular, I am trying to perform a merge like below: > > merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast machine. ?The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. ?The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a way to speed it up. > > I have tried ?a number of different things to speed things up to no avail. ?I've noticed that rbinds execute much faster using matrices than dataframes. ?However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7). ?I tried casting my merge field (date) into various different data types (character, factor, date). ?This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame. ?I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.). > > I'm starting to run out of ideas, anyone? ?Merging a 60 row dataset shouldn't take that long. >I don't know for sure if its faster but you could try using sqldf since it won't rely on R facilities to do the join. See the examples here:: http://code.google.com/p/sqldf/#Example_4._Join and if that is still too slow try adding indexes as illustrated in examples 4h and 4i there. Its very easy to try since its just one sql statement to do a join and one to index each table if you need that. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Matthew Dowle
2010-Nov-24 16:15 UTC
[R] Performance tuning tips when working with wide datasets
Richard, Try data.table. See the introduction vignette and the presentations e.g. there is a slide showing a join to 183,000,000 observations of daily stock prices in 0.002 seconds. data.table has fast rolling joins (i.e. fast last observation carried forward) too. I see you asked about that on this list on 8 Nov. Also see fast aggregations using 'by' on a key()-ed in-memory table. I wonder if your 20,000 columns are always populated for all rows. If not then consider collapsing to a 3 column table (row,col,data) and then joining to that. You may have that format in your original data source anyway, so you may be able to skip a step you may have implemented already which expands that format to wide. In other words, keeping it narrow may be an option (like how a sparse matrix is stored). Matthew http://datatable.r-forge.r-project.org/ "Richard Vlasimsky" <richard.vlasimsky at imidex.com> wrote in message news:2E042129-4430-4C66-9308-A36B761EBBEB at imidex.com...> > Does anyone have any performance tuning tips when working with datasets > that are extremely wide (e.g. 20,000 columns)? > > In particular, I am trying to perform a merge like below: > > merged_data <- merge(data1, data2, > by.x="date",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast machine. > The dataset data1 contains daily data going back to 1950 (20,000 rows) and > has 25 columns. The dataset data2 contains annual data (only 60 > observations), however there are lots of columns (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a way to > speed it up. > > I have tried a number of different things to speed things up to no avail. > I've noticed that rbinds execute much faster using matrices than > dataframes. However the performance improvement when using matrices (vs. > data frames) on merges were negligible (8 hours down to 7). I tried > casting my merge field (date) into various different data types > (character, factor, date). This didn't seem to have any effect. I tried > the hash package, however, merge couldn't coerce the class into a > data.frame. I've tried various ways to parellelize computation in the > past, and found that to be problematic for a variety of reasons (runaway > forked processes, doesn't run in a GUI environment, doesn't run on Macs, > etc.). > > I'm starting to run out of ideas, anyone? Merging a 60 row dataset > shouldn't take that long. > > Thanks, > Richard
Andreas Borg
2010-Nov-29 11:50 UTC
[R] Performance tuning tips when working with wide datasets
Richard Vlasimsky schrieb:> Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)? > > In particular, I am trying to perform a merge like below: > > merged_data <- merge(data1, data2, by.x=ate",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast machine. The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a way to speed it up. > > I have tried a number of different things to speed things up to no avail. I've noticed that rbinds execute much faster using matrices than dataframes. However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7). I tried casting my merge field (date) into various different data types (character, factor, date). This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame. I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.). > > I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't take that long. > > Thanks, > Richard > >Hi Richard, I had similar problems (even with much less data) and found out that most of the running time was caused by memory swapping instead of CPU usage. If you do not need all of the merged data at once, block-wise processing can help, which means that you only generate that much merged data at once as fits into main memory. I ended up using package RSQLite (an embedded database) in the following way: -create a database connection (explained in the package docs) -copy data to database tables via dbWriteTable() -create indices on the columns which are used for merging, sth. like: "dbGetQuery(con, 'create index index_year on table2(year)')" -> this speeds up joining significantly -construct an SQL query to do the join / merge operation and send it to SQLite via dbSendQuery() -retreive the result in blocks of reasonable size with fetch() Unless there is an operation in the query which requires SQLite to process the whole result (e.g. sorting), the result rows will be created on the fly for every call of fetch() instead of a huge table being allocated in addition to the original data. I am not sure if this works with other database engines (there are a couple of database interfaces on CRAN); when I tried to use RPostgreSQL, it created the whole result set at once, leading to the same memory problem. Maybe that behavior can be changed by some config variable. Best regards, Andreas -- Andreas Borg Medizinische Informatik UNIVERSIT?TSMEDIZIN der Johannes Gutenberg-Universit?t Institut f?r Medizinische Biometrie, Epidemiologie und Informatik Obere Zahlbacher Stra?e 69, 55131 Mainz www.imbei.uni-mainz.de Telefon +49 (0) 6131 175062 E-Mail: borg at imbei.uni-mainz.de Diese E-Mail enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und l?schen Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und der darin enthaltenen Informationen ist nicht gestattet.
Reasonably Related Threads
- A more efficient way to roll values in an irregular time series dataset?
- Dates in Common
- Problems creating a PNG file for a dendrogram: "Error in plot.window(...) : need finite 'xlim' values"
- Distances between two datasets of x and y co-ordinates
- plotting a double y axis when x and y lengths differ