Dear R Users, I use for my data crunching a combination of MySQL and GNU R. I have to handle huge/ middle seized data which is stored in a MySql database, R executes a SQL command to fetch the data and does the plotting with the build in R plotting functions. The (low level) calculations like summing, dividing, grouping, sorting etc. can be done either with the sql command on the MySQL side or on the R side. My question is what is faster for this low level calculations / data rearrangement MySQL or R? Is there a general rule of thumb what to shift to the MySql side and what to the R side? Thanks Henri
On Fri, Apr 1, 2011 at 1:46 PM, Henri Mone <henriMone at gmail.com> wrote:> Dear R Users, > > I use for my data crunching a combination of MySQL and GNU R. I have > to handle huge/ middle seized data which is stored in a MySql > database, R executes a SQL command to fetch the data and does the > plotting with the build in R plotting functions. > > The (low level) calculations like summing, dividing, grouping, sorting > etc. can be done either with the sql command on the MySQL side or on > the R side. > My question is what is faster for this low level calculations / data > rearrangement MySQL or R? Is there a general rule of thumb what to > shift to the MySql side and what to the R side? > > Thanks > Henri > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >I would assume RDBMS have advanced memory management capabilities and are designed for the manipulation and handling of large amounts of data. These are primary features for most database management server software. This way the database management server software should (in most cases) be used to store, manipulate then return only the processed and qualifying records to the client or other application for further specialized processing and/or data visualization. Allan.
On Fri, 1 Apr 2011, Henri Mone wrote:> Dear R Users, > > I use for my data crunching a combination of MySQL and GNU R. I have > to handle huge/ middle seized data which is stored in a MySql > database, R executes a SQL command to fetch the data and does the > plotting with the build in R plotting functions. > > The (low level) calculations like summing, dividing, grouping, sorting > etc. can be done either with the sql command on the MySQL side or on > the R side. > My question is what is faster for this low level calculations / data > rearrangement MySQL or R? Is there a general rule of thumb what to > shift to the MySql side and what to the R side?The data transfer costs almost always dominate here: since such low-level computations would almost always be a trivial part of the total costs, you should do things which can reduce the size (e.g. summarizations) in the DBMS. I do wonder what you think the R-sig-db list is for if not questions such as this one. Please subscribe and use it next time.> Thanks > Henri-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Fri, Apr 1, 2011 at 6:46 AM, Henri Mone <henriMone at gmail.com> wrote:> Dear R Users, > > I use for my data crunching a combination of MySQL and GNU R. I have > to handle huge/ middle seized data which is stored in a MySql > database, R executes a SQL command to fetch the data and does the > plotting with the build in R plotting functions. > > The (low level) calculations like summing, dividing, grouping, sorting > etc. can be done either with the sql command on the MySQL side or on > the R side. > My question is what is faster for this low level calculations / data > rearrangement MySQL or R? Is there a general rule of thumb what to > shift to the MySql side and what to the R side?The sqldf package makes it easy to use sqlite, h2 or postgresql from R to carry out data manipulation tasks and this has facilitated some benchmarks by users using sqlite's capability of using an "in memory" database. In the cases cited on the sqldf home page sqlite was faster than R despite the overhead of moving the data into the database and out again. See http://sqldf.googlecode.com In general the answer would depend on the database, what has been cached, the particular query, size of data and how well you had optimized your sql and R queries. There are entire books on optimizing MySQL so this is an extensive subject. Various comparisons of different approaches can easily result in different ordering from fastest to slowest based on what would appear to be relatively minor aspects of the problem so you would have to benchmark the important queries to really get an answer that pertains to your situation. Check out the rbenchmark package for this which makes it relatively simple to do. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Fri, Apr 1, 2011 at 11:46 AM, Henri Mone <henriMone at gmail.com> wrote:> Dear R Users, > > I use for my data crunching a combination of MySQL and GNU R. I have > to handle huge/ middle seized data which is stored in a MySql > database, R executes a SQL command to fetch the data and does the > plotting with the build in R plotting functions. > > The (low level) calculations like summing, dividing, grouping, sorting > etc. can be done either with the sql command on the MySQL side or on > the R side. > My question is what is faster for this low level calculations / data > rearrangement MySQL or R? Is there a general rule of thumb what to > shift to the MySql side and what to the R side?Given that you are already set up to test this yourself, why don't you? SELECT everything from a table and add it in R, and then SELECT sum(everything) from a table and compare the time (obviously your example might be more complex). Post some benchmark test results together with your hardware spec. Probably best to the db-flavour R mailing list. Is the MySQl server running locally, ie on the same machine? Maybe PostgreSQL will be even faster? So many of these questions are problem-specific and hardware-setup related. You can get massive speedups by having more RAM, or more disk, or spreading your giant database onto multiple servers. Rules of thumb are rare in this world, since everyone's thumbs are different sizes and are being stuck into different sized problems. Barry