I am attempting to perform some simple data manipulation on a large data set. I have a snippet of the whole data set, and my small snippet is 2GB in CSV. Is there a way I can read my csv, select a few columns, and write it to an output file in real time? This is what I do right now to a small test file: data <- read.csv('data.csv', header = FALSE) data_filter <- data[c(1,3,4)] write.table(data_filter, file = "filter_data.csv", sep = ",", row.names FALSE, col.names = FALSE) This test file writes the three columns to my desired output file. Can I do this while bypassing the storage of the entire array in memory? Thank you very much for the help. -- Jason [[alternative HTML version deleted]]
Establish a "connection" with the file you want to read, read in 1,000 rows (or whatever you want). If you are using read.csv and there is a header, you might want to skip it initially since there will be no header when you read the next 1000 rows. Also put 'as.is=TRUE" so that character fields are not converted to factors. You can then write out the columns that you want. You can put this in a loop till you reach the end of file. On Mon, Aug 25, 2008 at 3:34 PM, Jason Thibodeau <jbloudg20 at gmail.com> wrote:> I am attempting to perform some simple data manipulation on a large data > set. I have a snippet of the whole data set, and my small snippet is 2GB in > CSV. > > Is there a way I can read my csv, select a few columns, and write it to an > output file in real time? This is what I do right now to a small test file: > > data <- read.csv('data.csv', header = FALSE) > > data_filter <- data[c(1,3,4)] > > write.table(data_filter, file = "filter_data.csv", sep = ",", row.names > FALSE, col.names = FALSE) > > This test file writes the three columns to my desired output file. Can I do > this while bypassing the storage of the entire array in memory? > > Thank you very much for the help. > -- > Jason > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Hi, Jason Thibodeau wrote:> I am attempting to perform some simple data manipulation on a large data > set. I have a snippet of the whole data set, and my small snippet is 2GB in > CSV. > > Is there a way I can read my csv, select a few columns, and write it to an > output file in real time? This is what I do right now to a small test file: > > data <- read.csv('data.csv', header = FALSE) > > data_filter <- data[c(1,3,4)] > > write.table(data_filter, file = "filter_data.csv", sep = ",", row.names > FALSE, col.names = FALSE)in this case, I think R is not the best tool for the job. I would rather suggest to use an implementation of the awk language (e.g. gawk). I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB unzipped), piped into gawk) unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt and it took about 90 seconds. Please note that you might need to specify your delimiter (field separator (FS) and output field separator (OFS)) => gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv I hope this helps (despite not encouraging the usage of R), Roland