thr3ads.net - R help - [R] Memory usage in read.csv() [Jan 2010]

If this information is useful, please help other people find it:
Share via:

nabble.30.miller_2555 at spamgourmet.com

2010-Jan-19 14:25 UTC

[R] Memory usage in read.csv()

I'm sure this has gotten some attention before, but I have two CSV
files generated from vmstat and free that are roughly 6-8 Mb (about
80,000 lines) each. When I try to use read.csv(), R allocates all
available memory (about 4.9 Gb) when loading the files, which is over
300 times the size of the raw data.  Here are the scripts used to
generate the CSV files as well as the R code:

Scripts (run for roughly a 24-hour period):
    vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" ";
OFS=","; print
strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >>
~/vmstat_20100118_133845.o;
    free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=",";
print
strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >>
~/memfree_20100118_140845.o;

R code:
    infile.vms <- "~/vmstat_20100118_133845.o";
    infile.mem <- "~/memfree_20100118_140845.o";
    vms.colnames <-
c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st");
    vms.colclass <-
c("character",rep("integer",length(vms.colnames)-1));
    mem.colnames <-
c("time","total","used","free","shared","buffers","cached");
    mem.colclass <-
c("character",rep("integer",length(mem.colnames)-1));
    vmsdf <-
(read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames));
    memdf <-
(read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames));

I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux
version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There
are no other significant programs running and `rm()` followed by `
gc()` successfully frees the memory (followed by swapins after other
programs seek to used previously cached information swapped to disk).
I've incorporated the memory-saving suggestions in the `read.csv()`
manual page, excluding the limit on the lines read (which shouldn't
really be necessary here since we're only talking about < 20 Mb of raw
data. Any suggestions, or is the read.csv() code known to have memory
leak/ overcommit issues?

Thanks

jim holtman

2010-Jan-19 19:02 UTC

head link

[R] Memory usage in read.csv()

I read vmstat data in just fine without any problems.? Here is an
example of how I do it:

VMstat <- read.table('vmstat.txt', header=TRUE, as.is=TRUE)

vmstat.txt looks like this:

date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id
07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99
07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99
07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99
07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99
07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99
07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097
1901 3 2 95
07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99
07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99

Have you tried a smaller portion of data?

Here is what it took to read in a file with 85K lines:
> system.time(vmstat <- read.table('c:/vmstat.txt', header=TRUE))   user  system elapsed
   2.01    0.01    2.03> str(vmstat)'data.frame':   85680 obs. of  20 variables:
 $ date    : Factor w/ 2 levels "07/27/05","07/28/05": 1 1 1
1 1 1 1 1 1 1 ...
 $ time    : Factor w/ 2856 levels "00:00:26","00:00:56",..:
27 29 31
33 35 37 39 41 43 45 ...
 $ r       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ b       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ w       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ swap    : int  27755440 27755280 27753952 27755304 27755064
27753824 27754472 27754568 27754560 27754704 ...
 $ free    : int  13051648 13051480 13051248 13051496 13051232
13040720 13027000 13027104 13027096 13027240 ...
 $ re      : int  20 11 18 17 41 125 15 17 13 12 ...
 $ mf      : int  86 53 88 85 278 1039 91 85 69 51 ...
 $ pi      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ po      : int  0 0 0 0 1 0 0 0 0 1 ...
 $ fr      : int  0 0 0 0 1 0 0 0 0 1 ...
 $ de      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ sr      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ intr    : int  456 399 424 430 452 664 432 416 425 432 ...
 $ syscalls: int  2918 1722 1259 1029 2047 4097 1160 1058 1198 1727 ...
 $ cs      : int  1323 1411 1254 1246 1386 1901 1273 1271 1268 1477 ...
 $ user    : int  0 0 0 0 0 3 0 0 0 0 ...
 $ sys     : int  1 1 1 1 1 2 1 1 1 1 ...
 $ id      : int  99 99 99 99 99 95 99 99 99 99 ...>

On Tue, Jan 19, 2010 at 9:25 AM, <nabble.30.miller_2555 at
spamgourmet.com> wrote:>
> I'm sure this has gotten some attention before, but I have two CSV
> files generated from vmstat and free that are roughly 6-8 Mb (about
> 80,000 lines) each. When I try to use read.csv(), R allocates all
> available memory (about 4.9 Gb) when loading the files, which is over
> 300 times the size of the raw data. ?Here are the scripts used to
> generate the CSV files as well as the R code:
>
> Scripts (run for roughly a 24-hour period):
> ? ?vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" ";
OFS=","; print
> strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}'
>>
> ~/vmstat_20100118_133845.o;
> ? ?free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=",";
print
> strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >>
> ~/memfree_20100118_140845.o;
>
> R code:
> ? ?infile.vms <- "~/vmstat_20100118_133845.o";
> ? ?infile.mem <- "~/memfree_20100118_140845.o";
> ? ?vms.colnames <-
>
c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st");
> ? ?vms.colclass <-
c("character",rep("integer",length(vms.colnames)-1));
> ? ?mem.colnames <-
c("time","total","used","free","shared","buffers","cached");
> ? ?mem.colclass <-
c("character",rep("integer",length(mem.colnames)-1));
> ? ?vmsdf <-
(read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames));
> ? ?memdf <-
(read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames));
>
> I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux
> version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There
> are no other significant programs running and `rm()` followed by `
> gc()` successfully frees the memory (followed by swapins after other
> programs seek to used previously cached information swapped to disk).
> I've incorporated the memory-saving suggestions in the `read.csv()`
> manual page, excluding the limit on the lines read (which shouldn't
> really be necessary here since we're only talking about < 20 Mb of
raw
> data. Any suggestions, or is the read.csv() code known to have memory
> leak/ overcommit issues?
>
> Thanks
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Gabor Grothendieck

2010-Jan-19 19:30 UTC

head link

[R] Memory usage in read.csv()

You could also try read.csv.sql in sqldf.  See examples on sqldf home page:

http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql

On Tue, Jan 19, 2010 at 9:25 AM,  <nabble.30.miller_2555 at
spamgourmet.com> wrote:> I'm sure this has gotten some attention before, but I have two CSV
> files generated from vmstat and free that are roughly 6-8 Mb (about
> 80,000 lines) each. When I try to use read.csv(), R allocates all
> available memory (about 4.9 Gb) when loading the files, which is over
> 300 times the size of the raw data. ?Here are the scripts used to
> generate the CSV files as well as the R code:
>
> Scripts (run for roughly a 24-hour period):
> ? ?vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" ";
OFS=","; print
> strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}'
>>
> ~/vmstat_20100118_133845.o;
> ? ?free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=",";
print
> strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >>
> ~/memfree_20100118_140845.o;
>
> R code:
> ? ?infile.vms <- "~/vmstat_20100118_133845.o";
> ? ?infile.mem <- "~/memfree_20100118_140845.o";
> ? ?vms.colnames <-
>
c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st");
> ? ?vms.colclass <-
c("character",rep("integer",length(vms.colnames)-1));
> ? ?mem.colnames <-
c("time","total","used","free","shared","buffers","cached");
> ? ?mem.colclass <-
c("character",rep("integer",length(mem.colnames)-1));
> ? ?vmsdf <-
(read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames));
> ? ?memdf <-
(read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames));
>
> I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux
> version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There
> are no other significant programs running and `rm()` followed by `
> gc()` successfully frees the memory (followed by swapins after other
> programs seek to used previously cached information swapped to disk).
> I've incorporated the memory-saving suggestions in the `read.csv()`
> manual page, excluding the limit on the lines read (which shouldn't
> really be necessary here since we're only talking about < 20 Mb of
raw
> data. Any suggestions, or is the read.csv() code known to have memory
> leak/ overcommit issues?
>
> Thanks
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jan 2010 - Memory usage in read.csv()

[R] Memory usage in read.csv()

[R] Memory usage in read.csv()

[R] Memory usage in read.csv()

Possibly Parallel Threads