Tal Galili
2009-Mar-18 23:17 UTC
[R] Reading a file line by line - separating lines VS separating columns
Hello all. I wish to read a large data set into R. My current issue is in getting the data so that R would be able to access it. Using read.table won't work since the data is over 1GB in size (and I am using windows XP), so my plan was to read the file chunk by chunk and each time move it into bigmemory (I'll play with that when the time will come, maybe ff is better ?!). I encountered a problem with separating lines VS separating columns, to which I found a solution but it doesn't feel to be a smart solution, any ideas or help of how to improve this would be welcomed. # sample code: # creating a simple file zz <- file("ex.data", "w") # open an output file connection cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep "\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep "\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep "\n") (temp.file = scan("ex.data", what = "", sep = "\n")) # here we can limit the amount of rows we want to use and start from a specific row using skip # or: #(aa = readLines("ex.data")) str(aa) # we get a vector of character new.df <- NULL # we go through the vector to split the columns for(i in 1:length(aa)) { new.df <- rbind(new.df ,unlist(strsplit(temp.file[i], "\t"))) } new.df # or maybe apply(as.data.frame(temp.file), 1, function(b) unlist(strsplit(b, "\t"))) # but this transposes the matrix Thanks, Tal -- ---------------------------------------------- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: http://www.r-statistics.com/ http://www.talgalili.com http://www.biostatistics.co.il [[alternative HTML version deleted]]
jim holtman
2009-Mar-19 01:28 UTC
[R] Reading a file line by line - separating lines VS separating columns
You can do something like this using connections and read in a set of lines and saving the results in bigmemory, or in this case a 'save' image: zz <- file("ex.data", "w") # open an output file for (i in 1:10000)cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep ="\n") close(zz) # read in the data 876 lines at a time and write out an image zz <- file("ex.data", "r") fileNo <- 1 repeat{ gotError <- 1 # set to 2 if there is an error # catch the error if not more data tryCatch(input <- read.table(zz, nrows=876, sep='\t'), error=function(x) gotError <<- 2) if (gotError == 2) break # save the intermediate data save(input, file=sprintf("file%03d.RDData", fileNo)) fileNo <- fileNo + 1 } close(zz) On Wed, Mar 18, 2009 at 7:17 PM, Tal Galili <tal.galili at gmail.com> wrote:> Hello all. > > I wish to read a large data set into R. ?My current issue is in getting the > data so that R would be able to access it. ?Using read.table won't work > since the data is over 1GB in size (and I am using windows XP), so my plan > was to read the file chunk by chunk and each time move it into bigmemory > (I'll play with that when the time will come, maybe ff is better ?!). > > I encountered a problem with separating lines VS separating columns, to > which I found a solution but it doesn't feel to be a smart solution, any > ideas or help of how to improve this would be welcomed. > > > > # sample code: > > # creating a simple file zz <- file("ex.data", "w") # open an output file > connection cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep > "\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep > "\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep > "\n") (temp.file = scan("ex.data", what = "", sep = "\n")) # here we can > limit the amount of rows we want to use and start from a specific row using > skip # or: #(aa = readLines("ex.data")) str(aa) # we get a vector of > character new.df <- NULL # we go through the vector to split the columns > for(i in 1:length(aa)) { new.df <- rbind(new.df > ,unlist(strsplit(temp.file[i], "\t"))) } new.df # or maybe > apply(as.data.frame(temp.file), 1, function(b) unlist(strsplit(b, "\t"))) # > but this transposes the matrix > > > Thanks, > Tal > > > -- > ---------------------------------------------- > > > My contact information: > Tal Galili > Phone number: 972-50-3373767 > FaceBook: Tal Galili > My Blogs: > http://www.r-statistics.com/ > http://www.talgalili.com > http://www.biostatistics.co.il > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?