Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]]
In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:> Hi R list, > > I'm new to R software, so I'd like to ask about it is capabilities. > What I'm looking to do is to run some statistical tests on quite big > tables which are aggregated quotes from a market feed. > > This is a typical set of data. > Each day contains millions of records (up to 10 non filtered). > > 2011-05-24 ? ? ?750 ? ? Bid ? ? DELL ? ?14130770 ? ? ? ?400 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 1 ? ? ? 0 ? ? ? 0 > 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130772 ? ? ? ?300 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0 > 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130773 ? ? ? ?135 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0 > > I'll need to filter it out first based on some criteria. > Since I keep it mysql database, it can be done through by query. Not > super efficient, checked it already. > > Then I need to aggregate dataset into different time frames (time is > represented in ms from midnight, like 35482391). > Again, can be done through a databases query, not sure what gonna be faster. > Aggregated tables going to be much smaller, like thousands rows per > observation day. > > Then calculate basic statistic: mean, standard deviation, sums etc. > After stats are calculated, I need to perform some statistical > hypothesis tests. > > So, my question is: what tool faster for data aggregation and filtration > on big datasets: mysql or R? > > Thanks, > --Roman N. > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- ==============================================Jon Daily Technician ==============================================#!/usr/bin/env outside # It's great, trust me.
Hi, On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:> Hi R list, > > I'm new to R software, so I'd like to ask about it is capabilities. > What I'm looking to do is to run some statistical tests on quite big > tables which are aggregated quotes from a market feed. > > This is a typical set of data. > Each day contains millions of records (up to 10 non filtered). > > 2011-05-24 ? ? ?750 ? ? Bid ? ? DELL ? ?14130770 ? ? ? ?400 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 1 ? ? ? 0 ? ? ? 0 > 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130772 ? ? ? ?300 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0 > 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130773 ? ? ? ?135 > 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0 > > I'll need to filter it out first based on some criteria. > Since I keep it mysql database, it can be done through by query. Not > super efficient, checked it already. > > Then I need to aggregate dataset into different time frames (time is > represented in ms from midnight, like 35482391). > Again, can be done through a databases query, not sure what gonna be faster. > Aggregated tables going to be much smaller, like thousands rows per > observation day. > > Then calculate basic statistic: mean, standard deviation, sums etc. > After stats are calculated, I need to perform some statistical > hypothesis tests. > > So, my question is: what tool faster for data aggregation and filtration > on big datasets: mysql or R?Why not try a few experiments and see for yourself -- I guess the answer will depend on what exactly you are doing. If your datasets are *really* huge, check out some packages listed under the "Large memory and out-of-memory data" section of the "HighPerformanceComputing" task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Also, if you find yourself needing to do lots of "grouping/summarizing" type of calculations over large data frame-like objects, you might want to check out the data.table package: http://cran.r-project.org/web/packages/data.table/index.html -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Well, that answered some of my questions, though you forgot to send your answer to the r-help list rather than just to me. I don't use windows, so someone else may have better advice. On Wed, May 25, 2011 at 12:02 PM, <gbrenes at ssc.wisc.edu> wrote:> Sorry, I forgot to be more specific. > > I am using Windows XP. > > I am using R.12.2 > > > I installed both packages from the install packages menu.And were there any messages?> I always write library(name.of.library), and it is enough. > > But when I write library(nlme), R does not find nlme right away > > I load nlme first and it says package was downloaded succesfully.load? Installed? Downloaded successfully is not the same as installed successfully. How about the actual wording?> However, when I try to do this again in another day, R cannot find nlme, > so I try to load mgcv with library(mgcv), then I get this message: > > Error: package 'nlme' could not be loaded > In addition: Warning message: > In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc > lib.loc) : > ?there is no package called 'nlme' > > > > Is there any problem with nlme that I need to install it every time I open R?I wouldn't think so. But obviously something is not right, and you still haven't provided enough information to be able to diagnose the problem. Sarah> > Gilbert > > > >> We really need some more information to be able to help you (as >> requested in the posting guide): >> >> What OS? >> What version of R? >> >> How did you install nlme? Were there any messages? >> >> What happens when you type library(nlme) at the R prompt? >> >> How did you install mgcv? Were there any messages? >> >> >> On Wed, May 25, 2011 at 11:13 AM, ?<gbrenes at ssc.wisc.edu> wrote: >>> Hi. >>> >>> I have been trying to load the mgcv package but I always get the error >>> message: >>> >>> ?there is no package called 'nlme' >>> Error: package/namespace load failed for 'mgcv' >>> >>> >>> I load the package nlme and still I get the same message. ?I have >>> noticed >>> that there are some problems in using nlme in recent versions of R. ?Is >>> there any suggestion or any special issue that I should know about nlme >>> or >>> mgcv? >>> >>> Thanks >>> >>> >>> Gilbert >> >>-- Sarah Goslee http://www.functionaldiversity.org