thr3ads.net - R help - [R] Processing large datasets [May 2011]

If this information is useful, please help other people find it:
Share via:

Roman Naumenko

2011-May-25 04:29 UTC

[R] Processing large datasets

Hi R list,

I'm new to R software, so I'd like to ask about it is capabilities.
What I'm looking to do is to run some statistical tests on quite big 
tables which are aggregated quotes from a market feed.

This is a typical set of data.
Each day contains millions of records (up to 10 non filtered).

2011-05-24      750     Bid     DELL    14130770        400     
15.4800         BATS    35482391        Y       1       1       0       0
2011-05-24      904     Bid     DELL    14130772        300     
15.4800         BATS    35482391        Y       1       0       0       0
2011-05-24      904     Bid     DELL    14130773        135     
15.4800         BATS    35482391        Y       1       0       0       0

I'll need to filter it out first based on some criteria.
Since I keep it mysql database, it can be done through by query. Not 
super efficient, checked it already.

Then I need to aggregate dataset into different time frames (time is 
represented in ms from midnight, like 35482391).
Again, can be done through a databases query, not sure what gonna be faster.
Aggregated tables going to be much smaller, like thousands rows per 
observation day.

Then calculate basic statistic: mean, standard deviation, sums etc.
After stats are calculated, I need to perform some statistical 
hypothesis tests.

So, my question is: what tool faster for data aggregation and filtration 
on big datasets: mysql or R?

Thanks,
--Roman N.

	[[alternative HTML version deleted]]

Jonathan Daily

2011-May-25 12:12 UTC

head link

[R] Processing large datasets

In cases where I have to parse through large datasets that will not
fit into R's memory, I will grab relevant data using SQL and then
analyze said data using R. There are several packages designed to do
this, like [1] and [2] below, that allow you to query a database using
SQL and end up with that data in an R data.frame.

[1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
[2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com>
wrote:> Hi R list,
>
> I'm new to R software, so I'd like to ask about it is capabilities.
> What I'm looking to do is to run some statistical tests on quite big
> tables which are aggregated quotes from a market feed.
>
> This is a typical set of data.
> Each day contains millions of records (up to 10 non filtered).
>
> 2011-05-24 ? ? ?750 ? ? Bid ? ? DELL ? ?14130770 ? ? ? ?400
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 1 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130772 ? ? ? ?300
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130773 ? ? ? ?135
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
>
> I'll need to filter it out first based on some criteria.
> Since I keep it mysql database, it can be done through by query. Not
> super efficient, checked it already.
>
> Then I need to aggregate dataset into different time frames (time is
> represented in ms from midnight, like 35482391).
> Again, can be done through a databases query, not sure what gonna be
faster.
> Aggregated tables going to be much smaller, like thousands rows per
> observation day.
>
> Then calculate basic statistic: mean, standard deviation, sums etc.
> After stats are calculated, I need to perform some statistical
> hypothesis tests.
>
> So, my question is: what tool faster for data aggregation and filtration
> on big datasets: mysql or R?
>
> Thanks,
> --Roman N.
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
==============================================Jon Daily
Technician
==============================================#!/usr/bin/env outside
# It's great, trust me.

Steve Lianoglou

2011-May-25 14:00 UTC

head link

[R] Processing large datasets

Hi,

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com>
wrote:> Hi R list,
>
> I'm new to R software, so I'd like to ask about it is capabilities.
> What I'm looking to do is to run some statistical tests on quite big
> tables which are aggregated quotes from a market feed.
>
> This is a typical set of data.
> Each day contains millions of records (up to 10 non filtered).
>
> 2011-05-24 ? ? ?750 ? ? Bid ? ? DELL ? ?14130770 ? ? ? ?400
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 1 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130772 ? ? ? ?300
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130773 ? ? ? ?135
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
>
> I'll need to filter it out first based on some criteria.
> Since I keep it mysql database, it can be done through by query. Not
> super efficient, checked it already.
>
> Then I need to aggregate dataset into different time frames (time is
> represented in ms from midnight, like 35482391).
> Again, can be done through a databases query, not sure what gonna be
faster.
> Aggregated tables going to be much smaller, like thousands rows per
> observation day.
>
> Then calculate basic statistic: mean, standard deviation, sums etc.
> After stats are calculated, I need to perform some statistical
> hypothesis tests.
>
> So, my question is: what tool faster for data aggregation and filtration
> on big datasets: mysql or R?
Why not try a few experiments and see for yourself -- I guess the
answer will depend on what exactly you are doing.

If your datasets are *really* huge, check out some packages listed
under the "Large memory and out-of-memory data" section of the
"HighPerformanceComputing" task view at CRAN:

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Also, if you find yourself needing to do lots of
"grouping/summarizing" type of calculations over large data frame-like
objects, you might want to check out the data.table package:

http://cran.r-project.org/web/packages/data.table/index.html

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Sarah Goslee

2011-May-25 16:10 UTC

head link

[R] the mgcv package can not be loaded

Well, that answered some of my questions, though you forgot to send
your answer to the r-help list rather than just to me. I don't use
windows, so someone else may have better advice.

On Wed, May 25, 2011 at 12:02 PM,  <gbrenes at ssc.wisc.edu>
wrote:> Sorry, I forgot to be more specific.
>
> I am using Windows XP.
>
> I am using R.12.2
>
>
> I installed both packages from the install packages menu.
And were there any messages?
> I always write library(name.of.library), and it is enough.
>
> But when I write library(nlme), R does not find nlme right away
>
> I load nlme first and it says package was downloaded succesfully.
load? Installed? Downloaded successfully is not the same as installed
successfully. How about the actual wording?
> However, when I try to do this again in another day, R cannot find nlme,
> so I try to load mgcv with library(mgcv), then I get this message:
>
> Error: package 'nlme' could not be loaded
> In addition: Warning message:
> In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc >
lib.loc) :
> ?there is no package called 'nlme'
>
>
>
> Is there any problem with nlme that I need to install it every time I open
R?
I wouldn't think so. But obviously something is not right, and you
still haven't provided enough information to be able to diagnose the
problem.

Sarah
>
> Gilbert
>
>
>
>> We really need some more information to be able to help you (as
>> requested in the posting guide):
>>
>> What OS?
>> What version of R?
>>
>> How did you install nlme? Were there any messages?
>>
>> What happens when you type library(nlme) at the R prompt?
>>
>> How did you install mgcv? Were there any messages?
>>
>>
>> On Wed, May 25, 2011 at 11:13 AM, ?<gbrenes at ssc.wisc.edu>
wrote:
>>> Hi.
>>>
>>> I have been trying to load the mgcv package but I always get the
error
>>> message:
>>>
>>> ?there is no package called 'nlme'
>>> Error: package/namespace load failed for 'mgcv'
>>>
>>>
>>> I load the package nlme and still I get the same message. ?I have
>>> noticed
>>> that there are some problems in using nlme in recent versions of R.
?Is
>>> there any suggestion or any special issue that I should know about
nlme
>>> or
>>> mgcv?
>>>
>>> Thanks
>>>
>>>
>>> Gilbert
>>
>>

-- 
Sarah Goslee
http://www.functionaldiversity.org

Seemingly Similar Threads

Search for more seemingly similar threads

R help - May 2011 - Processing large datasets

[R] Processing large datasets

[R] Processing large datasets

[R] Processing large datasets

[R] the mgcv package can not be loaded

Seemingly Similar Threads