Josh B
2010-Apr-16 22:12 UTC
[R] Scanning only specific columns into R from a VERY large file
Hi, I turn to you, the R Sages, once again for help. You've never let me down! (1) Please make the following toy files: x <- read.table(textConnection("var.1 var.2 var.3 var.1000 indv.1 1 5 9 7 indv.210000 2 9 3 8"), header = TRUE) y <- read.table(textConnection("var.3 var.1000"), header = TRUE) write.csv(x, file = "x.csv") write.csv(y, file = "y.csv") (2) Pretend you are starting with the files "x.csv" and "y.csv." They come from another source -- an online database. Pretend that these files are much, much, much larger. Specifically: (a) Pretend that "x.csv" contains 1000 columns by 210,000 rows. (b) "y.csv" contains just header titles. Pretend that there are 90 header titles in "y.csv" in total. These header titles are a subset of the header titles in "x.csv." (3) What I want to do is scan (or import, or whatever the appropriate word is) only a subset of the columns from "x.csv" into an R. Specifically, I only want to scan the columns of data from "x.csv" into R that are indicated in the file "y.csv." I still want to scan in all 210000 rows from "x.csv," but only for the aforementioned columns listed in "y.csv." Can you guys recommend a strategy for me? I think I need to use the scan command, based on the hugeness of "x.csv," but I don't know what exactly to do. Specific code that gets the job done would be the most useful. Thank you very much in advance! Josh [[alternative HTML version deleted]]
Sharpie
2010-Apr-16 22:21 UTC
[R] Scanning only specific columns into R from a VERY large file
Josh B-3 wrote:> > Hi, > > I turn to you, the R Sages, once again for help. You've never let me down! > > (1) Please make the following toy files: > > x <- read.table(textConnection("var.1 var.2 var.3 var.1000 > indv.1 1 5 9 7 > indv.210000 2 9 3 8"), header = TRUE) > > y <- read.table(textConnection("var.3 var.1000"), header = TRUE) > > write.csv(x, file = "x.csv") > write.csv(y, file = "y.csv") > > (2) Pretend you are starting with the files "x.csv" and "y.csv." They come > from another source -- an online database. Pretend that these files are > much, much, much larger. Specifically: > (a) Pretend that "x.csv" contains 1000 columns by 210,000 rows. > (b) "y.csv" contains just header titles. Pretend that there are 90 > header titles in "y.csv" in total. These header titles are a subset of the > header titles in "x.csv." > > (3) What I want to do is scan (or import, or whatever the appropriate word > is) only a subset of the columns from "x.csv" into an R. Specifically, I > only want to scan the columns of data from "x.csv" into R that are > indicated in the file "y.csv." I still want to scan in all 210000 rows > from "x.csv," but only for the aforementioned columns listed in "y.csv." > > Can you guys recommend a strategy for me? I think I need to use the scan > command, based on the hugeness of "x.csv," but I don't know what exactly > to do. Specific code that gets the job done would be the most useful. > > Thank you very much in advance! > Josh >read.csv.sql() from the sqldf package looks like it may do what you want- it allows you to filter what gets read in from a CSV file using SQL statements, something like: SELECT list,of,column,names FROM file Hope this helps! -Charlie ----- Charlie Sharpsteen Undergraduate-- Environmental Resources Engineering Humboldt State University -- View this message in context: http://n4.nabble.com/Scanning-only-specific-columns-into-R-from-a-VERY-large-file-tp2013646p2013653.html Sent from the R help mailing list archive at Nabble.com.
Rubén Roa
2010-Apr-19 06:39 UTC
[R] Scanning only specific columns into R from a VERY large file
-----Mensaje original----- De: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] En nombre de Josh B Enviado el: s?bado, 17 de abril de 2010 0:12 Para: R Help Asunto: [R] Scanning only specific columns into R from a VERY large file Hi, I turn to you, the R Sages, once again for help. You've never let me down! (1) Please make the following toy files: x <- read.table(textConnection("var.1 var.2 var.3 var.1000 indv.1 1 5 9 7 indv.210000 2 9 3 8"), header = TRUE) y <- read.table(textConnection("var.3 var.1000"), header = TRUE) write.csv(x, file = "x.csv") write.csv(y, file = "y.csv") (2) Pretend you are starting with the files "x.csv" and "y.csv." They come from another source -- an online database. Pretend that these files are much, much, much larger. Specifically: (a) Pretend that "x.csv" contains 1000 columns by 210,000 rows. (b) "y.csv" contains just header titles. Pretend that there are 90 header titles in "y.csv" in total. These header titles are a subset of the header titles in "x.csv." (3) What I want to do is scan (or import, or whatever the appropriate word is) only a subset of the columns from "x.csv" into an R. Specifically, I only want to scan the columns of data from "x.csv" into R that are indicated in the file "y.csv." I still want to scan in all 210000 rows from "x.csv," but only for the aforementioned columns listed in "y.csv." Can you guys recommend a strategy for me? I think I need to use the scan command, based on the hugeness of "x.csv," but I don't know what exactly to do. Specific code that gets the job done would be the most useful. Thank you very much in advance! Josh ----------- Try with something like do.call("cbind",scan(file="yourfile.csv",what=list(NULL,NULL,....,0,NULL,0,NULL,NULL,...,NULL),flush=TRUE)) you have to work out how to set up the list of parameter 'what' to read the headers of 'y'. In the above the only columns read are those indicated by a '0'. HTH Ruben ____________________________________________________________________________________ Dr. Rub?n Roa-Ureta AZTI - Tecnalia / Marine Research Unit Txatxarramendi Ugartea z/g 48395 Sukarrieta (Bizkaia) SPAIN