nabble.30.miller_2555 at spamgourmet.com
2010-Jan-19 14:25 UTC
[R] Memory usage in read.csv()
I'm sure this has gotten some attention before, but I have two CSV files generated from vmstat and free that are roughly 6-8 Mb (about 80,000 lines) each. When I try to use read.csv(), R allocates all available memory (about 4.9 Gb) when loading the files, which is over 300 times the size of the raw data. Here are the scripts used to generate the CSV files as well as the R code: Scripts (run for roughly a 24-hour period): vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" "; OFS=","; print strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >> ~/vmstat_20100118_133845.o; free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=","; print strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >> ~/memfree_20100118_140845.o; R code: infile.vms <- "~/vmstat_20100118_133845.o"; infile.mem <- "~/memfree_20100118_140845.o"; vms.colnames <- c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st"); vms.colclass <- c("character",rep("integer",length(vms.colnames)-1)); mem.colnames <- c("time","total","used","free","shared","buffers","cached"); mem.colclass <- c("character",rep("integer",length(mem.colnames)-1)); vmsdf <- (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); memdf <- (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There are no other significant programs running and `rm()` followed by ` gc()` successfully frees the memory (followed by swapins after other programs seek to used previously cached information swapped to disk). I've incorporated the memory-saving suggestions in the `read.csv()` manual page, excluding the limit on the lines read (which shouldn't really be necessary here since we're only talking about < 20 Mb of raw data. Any suggestions, or is the read.csv() code known to have memory leak/ overcommit issues? Thanks
I read vmstat data in just fine without any problems.? Here is an example of how I do it: VMstat <- read.table('vmstat.txt', header=TRUE, as.is=TRUE) vmstat.txt looks like this: date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id 07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99 07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99 07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99 07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99 07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99 07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097 1901 3 2 95 07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99 07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99 Have you tried a smaller portion of data? Here is what it took to read in a file with 85K lines:> system.time(vmstat <- read.table('c:/vmstat.txt', header=TRUE))user system elapsed 2.01 0.01 2.03> str(vmstat)'data.frame': 85680 obs. of 20 variables: $ date : Factor w/ 2 levels "07/27/05","07/28/05": 1 1 1 1 1 1 1 1 1 1 ... $ time : Factor w/ 2856 levels "00:00:26","00:00:56",..: 27 29 31 33 35 37 39 41 43 45 ... $ r : int 0 0 0 0 0 0 0 0 0 0 ... $ b : int 0 0 0 0 0 0 0 0 0 0 ... $ w : int 0 0 0 0 0 0 0 0 0 0 ... $ swap : int 27755440 27755280 27753952 27755304 27755064 27753824 27754472 27754568 27754560 27754704 ... $ free : int 13051648 13051480 13051248 13051496 13051232 13040720 13027000 13027104 13027096 13027240 ... $ re : int 20 11 18 17 41 125 15 17 13 12 ... $ mf : int 86 53 88 85 278 1039 91 85 69 51 ... $ pi : int 0 0 0 0 0 0 0 0 0 0 ... $ po : int 0 0 0 0 1 0 0 0 0 1 ... $ fr : int 0 0 0 0 1 0 0 0 0 1 ... $ de : int 0 0 0 0 0 0 0 0 0 0 ... $ sr : int 0 0 0 0 0 0 0 0 0 0 ... $ intr : int 456 399 424 430 452 664 432 416 425 432 ... $ syscalls: int 2918 1722 1259 1029 2047 4097 1160 1058 1198 1727 ... $ cs : int 1323 1411 1254 1246 1386 1901 1273 1271 1268 1477 ... $ user : int 0 0 0 0 0 3 0 0 0 0 ... $ sys : int 1 1 1 1 1 2 1 1 1 1 ... $ id : int 99 99 99 99 99 95 99 99 99 99 ...>On Tue, Jan 19, 2010 at 9:25 AM, <nabble.30.miller_2555 at spamgourmet.com> wrote:> > I'm sure this has gotten some attention before, but I have two CSV > files generated from vmstat and free that are roughly 6-8 Mb (about > 80,000 lines) each. When I try to use read.csv(), R allocates all > available memory (about 4.9 Gb) when loading the files, which is over > 300 times the size of the raw data. ?Here are the scripts used to > generate the CSV files as well as the R code: > > Scripts (run for roughly a 24-hour period): > ? ?vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" "; OFS=","; print > strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >> > ~/vmstat_20100118_133845.o; > ? ?free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=","; print > strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >> > ~/memfree_20100118_140845.o; > > R code: > ? ?infile.vms <- "~/vmstat_20100118_133845.o"; > ? ?infile.mem <- "~/memfree_20100118_140845.o"; > ? ?vms.colnames <- > c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st"); > ? ?vms.colclass <- c("character",rep("integer",length(vms.colnames)-1)); > ? ?mem.colnames <- c("time","total","used","free","shared","buffers","cached"); > ? ?mem.colclass <- c("character",rep("integer",length(mem.colnames)-1)); > ? ?vmsdf <- (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); > ? ?memdf <- (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); > > I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux > version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There > are no other significant programs running and `rm()` followed by ` > gc()` successfully frees the memory (followed by swapins after other > programs seek to used previously cached information swapped to disk). > I've incorporated the memory-saving suggestions in the `read.csv()` > manual page, excluding the limit on the lines read (which shouldn't > really be necessary here since we're only talking about < 20 Mb of raw > data. Any suggestions, or is the read.csv() code known to have memory > leak/ overcommit issues? > > Thanks > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
You could also try read.csv.sql in sqldf. See examples on sqldf home page: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql On Tue, Jan 19, 2010 at 9:25 AM, <nabble.30.miller_2555 at spamgourmet.com> wrote:> I'm sure this has gotten some attention before, but I have two CSV > files generated from vmstat and free that are roughly 6-8 Mb (about > 80,000 lines) each. When I try to use read.csv(), R allocates all > available memory (about 4.9 Gb) when loading the files, which is over > 300 times the size of the raw data. ?Here are the scripts used to > generate the CSV files as well as the R code: > > Scripts (run for roughly a 24-hour period): > ? ?vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" "; OFS=","; print > strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >> > ~/vmstat_20100118_133845.o; > ? ?free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=","; print > strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >> > ~/memfree_20100118_140845.o; > > R code: > ? ?infile.vms <- "~/vmstat_20100118_133845.o"; > ? ?infile.mem <- "~/memfree_20100118_140845.o"; > ? ?vms.colnames <- > c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st"); > ? ?vms.colclass <- c("character",rep("integer",length(vms.colnames)-1)); > ? ?mem.colnames <- c("time","total","used","free","shared","buffers","cached"); > ? ?mem.colclass <- c("character",rep("integer",length(mem.colnames)-1)); > ? ?vmsdf <- (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); > ? ?memdf <- (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); > > I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux > version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There > are no other significant programs running and `rm()` followed by ` > gc()` successfully frees the memory (followed by swapins after other > programs seek to used previously cached information swapped to disk). > I've incorporated the memory-saving suggestions in the `read.csv()` > manual page, excluding the limit on the lines read (which shouldn't > really be necessary here since we're only talking about < 20 Mb of raw > data. Any suggestions, or is the read.csv() code known to have memory > leak/ overcommit issues? > > Thanks > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >