1. Can I avoid having RODBC use so much memory (35 times the data size or more)
making a data.frame & then .rda file via. sqlQuery/save?
2. If not, is there some more appropriate way from w/in R to pull large data
sets (2-5GB) into .rda files from sql?
[R] reducing RODBC odbcQuery memory use?
From: WILLIE, JILL <JILWIL_at_SAFECO.com>
Date: Thu 25 Jan 2007 - 22:27:02 GMT
Basic Questions:
Can I avoid having RODBC use so much memory (35 times the data size or
more) making a data.frame & then .rda file via. sqlQuery/save?
If not, is there some more appropriate way from w/in R to pull large data
sets (2-5GB) into .rda files from sql?
I get an unexpectedly high ratio of virtual memory to memory use (10:1).
Can that be avoided?
Testing details (R transcript below): 1GB CPU, 1GB RAM windows machine.
testing bigger input table (AUTCombinedWA_BILossCost_1per), size is 20MB in
sql, 100000 rows, 2 numeric columns, 55 integer columns; consumes 350000kb of
memory & 800000kb of virtual memory to execute the sqlQuery command. Memory
not released after the step finishes or upon execution of odbcCloseAll(), or
gc().
tested small input table, size is 2MB in sql, 10000 rows, 2 numeric
columns, 55 integer columns; consumes 55000kb of memory & 515000kb (vm seems
oddly high to me) of virtual memory to execute the sqlQuery command.
concluded the high memory use is isolated to the odbcQuery step w/in the
sqlQuery function as opposed to sqlGetResults or ODBC itself.
Relevant R session transcript:
>library(RODBC)
>channel<-odbcConnect("psmrd") >df_OnePer <-data.frame(sqlQuery(channel, "select * from
AUTCombinedWA_BILossCost_1per"))
>save(df_OnePer, file = "df_OnePer.rda")
Additional testing details:
I exited R which released all memory cleanly, then started R again, loaded the
.rda saved in prior step as below. This confirmed relatively little of the
memory is consumed going from .rda to data frame; isolating to the RODBC step:
> load("df_OnePer.rda") > df <- data.frame(df_OnePer)
I closed R & opened MS Access & used the same DSN "psmrd",
& to import the AUTCombinedWA_BILossCost_1per into MS Access which required
about 30000kb of memory & 20000kb of virtual.
And finally, I have this excerpt from Prof Brian Ripley that seems potentially
relevant (if it's not just confusion because I called them
'byte-size' when really I should have said they're integers just
having values limited to 1-255). In any case I'm unable to see from the
RODBC help how to specify this: "...sqlQuery returns a data frame
directly. I think you need to tell RODBC to translate your 'byte-sized
factors' to numeric, as it will be going through character if these are a
type it does not know about."
Read all the RODBC help, read all the data import guide & searched help
archives...can't find an answer. Would appreciate advice, experience, or
direction.
Jill Willie
Open Seas
Safeco Insurance
jilwil@safeco.com
-----Original Message-----
From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] Sent: Thursday, January
25, 2007 12:05 AM To: WILLIE, JILL
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Size of data vs. needed memory...rule of thumb?
On Wed, 24 Jan 2007, WILLIE, JILL wrote:
> I have been searching all day & most of last night, but can't find
any
> benchmarking or recommendations regarding R system requirements for
very
> large (2-5GB) data sets to help guide our hardware configuration. If
> anybody has experience with this they're willing to share or could
> anybody point me in a direction that might be productive to research,
it
> would be much appreciated. Specifically: will R simply use as much
> memory as the OS makes available to it, unlimited?
Under most OSes. Because Windows has no means to limit the amount made
available, R under Windows does have it own limiting mechanism (which you
hit in the examples below). R under Linux will allow you to run a 4GB process
on a machine with 2GB RAM, but you are likely to get around 0.7%
usage. (One of my colleagues did that on a server earlier this week, hence
the very specific answer.)
> Is there a multi-threading version R, packages?
Not to run computations in R. Some parts of R (e.g. GUIs) and some libraries
(e.g. some BLAS) are multithreaded. There are multiprocess packages, e.g. Rmpi,
rpvm, snow.
> Does the core R package support 64-bit
Yes, and has for many years.
> & should I expect to see any difference in how memory's handled
under
> that version?
yes, because the address space will not get seriously fragmented. See the
appropriate section in R-admin.html (referenced from INSTALL).
> Is 3 GB of memory to 1GB of data a reasonable ballpark?
I'd say it was a bit low, but it really depends on the analysis you are
doing, how 1GB of data is made up (many rows?, many cols?, etc) and so on.
Had you asked me to suggest a ratio I would have said 5.
> Our testing thus far has been on a windows 32-bit box w/1GB of RAM & 1
> CPU; it appears to indicate something like 3GB of RAM for every 1GB of
> sql table (ex-indexes, byte-sized factors). At this point, we're
> planning on setting up a dual core 64-bit Linux box w/16GB of RAM for
> starters, since we have summed-down sql tables of approx 2-5GB
> generally.
>
> Here's details, just for context, or in case I'm misinterpreting
the
> results, or in case there's some more memory-efficient way to get data
> in R's binary format than going w/the data.frame.
Well, sqlQuery returns a data frame directly. I think you need to tell RODBC
to translate your 'byte-sized factors' to numeric, as it will be going
through character if these are a type it does not know about.
> R session:
> > library(RODBC) > > channel<-odbcConnect("psmrd")
> > FivePer <-data.frame(sqlQuery(channel, "select * from
> AUTCombinedWA_BILossCost_5per"))
>
> Error: cannot allocate vector of size 2000 Kb
> In addition: Warning messages:
> 1: Reached total allocation of 1023Mb: see
> help(memory.size)
> 2: Reached total allocation of 1023Mb: see
> help(memory.size)
>
>
> ODBC connection:
> Microsoft SQL Server ODBC Driver Version 03.86.1830
>
> Data Source Name: psmrd
> Data Source Description:
> Server: psmrdcdw01\modeling
> Database: OpenSeas_Work1
> Language: (Default)
> Translate Character Data: Yes
> Log Long Running Queries: No
> Log Driver Statistics: No
> Use Integrated Security: Yes
> Use Regional Settings: No
> Prepared Statements Option: Drop temporary procedures on
> disconnect
> Use Failover Server: No
> Use ANSI Quoted Identifiers: Yes
> Use ANSI Null, Paddings and Warnings: Yes
> Data Encryption: No
>
> Please be patient, I'm a new R user (or at least I'm trying to
be...at
> this point I'm mostly a new R-help-reader); I'd appreciated being
> pointed in the right direction if this isn't the right help list to
send
> this question to...or if this question is poorly worded (I did read
the
> posting guide). >
> Jill Willie
> Open Seas > Safeco Insurance
> jilwil@safeco.com
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
---------------------------------
[[alternative HTML version deleted]]