thr3ads.net - R help - [R] Large Data Set Help [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Jason Thibodeau

2008-Aug-25 19:34 UTC

[R] Large Data Set Help

I am attempting to perform some simple data manipulation on a large data
set. I have a snippet of the whole data set, and my small snippet is 2GB in
CSV.

Is there a way I can read my csv, select a few columns, and write it to an
output file in real time? This is what I do right now to a small test file:

data <- read.csv('data.csv', header = FALSE)

data_filter <- data[c(1,3,4)]

write.table(data_filter, file = "filter_data.csv", sep =
",", row.names FALSE, col.names = FALSE)

This test file writes the three columns to my desired output file. Can I do
this while bypassing the storage of the entire array in memory?

Thank you very much for the help.
-- 
Jason

	[[alternative HTML version deleted]]

jim holtman

2008-Aug-25 20:13 UTC

head link

[R] Large Data Set Help

Establish a "connection" with the file you want to read, read in 1,000
rows (or whatever you want).  If you are using read.csv and there is a
header, you might want to skip it initially since there will be no
header when you read the next 1000 rows.  Also put 'as.is=TRUE" so
that character fields are not converted to factors.  You can then
write out the columns that you want.  You can put this in a loop till
you reach the end of file.

On Mon, Aug 25, 2008 at 3:34 PM, Jason Thibodeau <jbloudg20 at gmail.com>
wrote:> I am attempting to perform some simple data manipulation on a large data
> set. I have a snippet of the whole data set, and my small snippet is 2GB in
> CSV.
>
> Is there a way I can read my csv, select a few columns, and write it to an
> output file in real time? This is what I do right now to a small test file:
>
> data <- read.csv('data.csv', header = FALSE)
>
> data_filter <- data[c(1,3,4)]
>
> write.table(data_filter, file = "filter_data.csv", sep =
",", row.names > FALSE, col.names = FALSE)
>
> This test file writes the three columns to my desired output file. Can I do
> this while bypassing the storage of the entire array in memory?
>
> Thank you very much for the help.
> --
> Jason
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Roland Rau

2008-Aug-25 20:47 UTC

head link

[R] Large Data Set Help

Hi,

Jason Thibodeau wrote:> I am attempting to perform some simple data manipulation on a large data
> set. I have a snippet of the whole data set, and my small snippet is 2GB in
> CSV.
> 
> Is there a way I can read my csv, select a few columns, and write it to an
> output file in real time? This is what I do right now to a small test file:
> 
> data <- read.csv('data.csv', header = FALSE)
> 
> data_filter <- data[c(1,3,4)]
> 
> write.table(data_filter, file = "filter_data.csv", sep =
",", row.names > FALSE, col.names = FALSE)
in this case, I think R is not the best tool for the job. I would rather 
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB 
unzipped), piped into gawk)
unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' >
myfiltereddata.txt
and it took about 90 seconds.

Please note that you might need to specify your delimiter (field 
separator (FS) and output field separator (OFS)) =>
gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv
> filter_data.scv

I hope this helps (despite not encouraging the usage of R),
Roland

Seemingly Similar Threads

Search for more reasonably related threads

R help - Aug 2008 - Large Data Set Help

[R] Large Data Set Help

[R] Large Data Set Help

[R] Large Data Set Help

Seemingly Similar Threads