When you read your file into R, show the structure of the object:
str(tab)
also the size of the object:
object.size(tab)
This will tell you what your data looks like and the size taken in R.
Also in read.table, use colClasses to define what the format of the
data is; may make it faster. You might want to force a garbage
collection 'gc()' to see if that frees up any memory. If your input
is about 2M lines and it looks like there are three column (alpha,
numeric, numeric), I would guess that you will probably have an
object.size of about 50MB. This information would help.
On Mon, Sep 14, 2009 at 11:11 PM, Evan Klitzke <evan at eklitzke.org>
wrote:> Hello all,
>
> To start with, these measurements are on Linux with R 2.9.2 (64-bit
> build) and Python 2.6 (also 64-bit).
>
> I've been investigating R for some log file analysis that I've been
> doing. I'm coming at this from the angle of a programmer whose
> primarily worked in Python. As I've been playing around with R,
I've
> noticed that R seems to use a *lot* of memory, especially compared to
> Python. Here's an example of what I'm talking about. I have a
sample
> data file whose characteristics are like this:
>
> [evan at t500 ~]$ ls -lh 20090708.tab
> -rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab
>
> [evan at t500 ~]$ head 20090708.tab
> spice 1247036405.04 0.0141088962555
> spice 1247036405.01 0.046797990799
> spice 1247036405.13 0.0137498378754
> spice 1247036404.87 0.0594480037689
> spice 1247036405.02 0.0170919895172
> topic 1247036404.74 0.512196063995
> user_details 1247036404.64 0.242133140564
> spice 1247036405.23 0.0408620834351
> biz_details 1247036405.04 0.40732884407
> spice 1247036405.35 0.0501029491425
>
> [evan at t500 ~]$ wc -l 20090708.tab
> 1797601 20090708.tab
>
> So it's basically a CSV file (actually, space delimited) where all of
> the lines are three columns, a low-cardinality string, a double, and a
> double. The file itself is 63M. Python can load all of the data from
> the file really compactly (source for the script at the bottom of the
> message):
>
> [evan at t500 ~]$ python code/scratch/pymem.py
> VIRT = 25230, RSS = 860
> VIRT = 81142, RSS = 55825
>
> So this shows that my Python process starts out at 860K RSS memory
> before doing any processing, and ends at 55M of RSS memory. This is
> pretty good, actually it's better than the size of the file, since a
> double can be stored more compactly than the textual data stored in
> the data file.
>
> Since I'm new to R I didn't know how to read /proc and so forth, so
> instead I launched an R repl and used ps to record the RSS memory
> usage before and after running the following statement:
>
>> tab <- read.table("~/20090708.tab")
>
> The numbers I measured were:
> VIRT = 176820, RSS = 26180 ? (just after starting the repl)
> VIRT = 414284, RSS = 263708 (after executing the command)
>
> This kind of concerns me. I can understand why R uses more memory at
> startup, since it's launching a full repl which my Python script
> wasn't doing. But I would have expected the memory usage to not have
> grown more like Python did after loading the data. In fact, R ought to
> be able to use less memory, since the first column is textual and has
> low cardinality (I think 7 distinct values), so storing it as a factor
> should be very memory efficient.
>
> For the things that I want to use R for, I know I'll be processing
> much larger datasets, and at the rate that R is consuming memory it
> may not be possible to fully load the data into memory. I'm concerned
> that it may not be worth pursuing learning R if it's possible to load
> the data into memory using something like Python but not R. I don't
> want to overlook the possibility that I'm overlooking something, since
> I'm new to the language. Can anyone answer for me:
> ?* What is R doing with all of that memory?
> ?* Is there something I did wrong? Is there a more memory-efficient
> way to load this data?
> ?* Are there R modules that can store large data-sets in a more
> memory-efficient way? Can anyone relate their experiences with them?
>
> For reference, here's the Python script I used to measure Python's
memory usage:
>
> import os
>
> def show_mem():
> ? ? ? ?statm = open('/proc/%d/statm' % os.getpid()).read()
> ? ? ? ?print 'VIRT = %s, RSS = %s' % tuple(statm.split('
')[:2])
>
> def read_data(fname):
> ? ? ? ?servlets = []
> ? ? ? ?timestamps = []
> ? ? ? ?elapsed = []
>
> ? ? ? ?for line in open(fname, 'r'):
> ? ? ? ? ? ? ? ?s, t, e = line.strip().split(' ')
> ? ? ? ? ? ? ? ?servlets.append(s)
> ? ? ? ? ? ? ? ?timestamps.append(float(t))
> ? ? ? ? ? ? ? ?elapsed.append(float(e))
>
> ? ? ? ?show_mem()
>
> if __name__ == '__main__':
> ? ? ? ?show_mem()
> ? ? ? ?read_data('/home/evan/20090708.tab')
>
>
> --
> Evan Klitzke <evan at eklitzke.org> :wq
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?