Tal Galili
2009-Mar-18 23:17 UTC
[R] Reading a file line by line - separating lines VS separating columns
Hello all.
I wish to read a large data set into R. My current issue is in getting the
data so that R would be able to access it. Using read.table won't work
since the data is over 1GB in size (and I am using windows XP), so my plan
was to read the file chunk by chunk and each time move it into bigmemory
(I'll play with that when the time will come, maybe ff is better ?!).
I encountered a problem with separating lines VS separating columns, to
which I found a solution but it doesn't feel to be a smart solution, any
ideas or help of how to improve this would be welcomed.
# sample code:
# creating a simple file zz <- file("ex.data", "w") #
open an output file
connection cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz,
sep "\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t",
file = zz, sep "\n") cat(
"1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep
"\n") (temp.file = scan("ex.data", what = "", sep
= "\n")) # here we can
limit the amount of rows we want to use and start from a specific row using
skip # or: #(aa = readLines("ex.data")) str(aa) # we get a vector of
character new.df <- NULL # we go through the vector to split the columns
for(i in 1:length(aa)) { new.df <- rbind(new.df
,unlist(strsplit(temp.file[i], "\t"))) } new.df # or maybe
apply(as.data.frame(temp.file), 1, function(b) unlist(strsplit(b,
"\t"))) #
but this transposes the matrix
Thanks,
Tal
--
----------------------------------------------
My contact information:
Tal Galili
Phone number: 972-50-3373767
FaceBook: Tal Galili
My Blogs:
http://www.r-statistics.com/
http://www.talgalili.com
http://www.biostatistics.co.il
[[alternative HTML version deleted]]
jim holtman
2009-Mar-19 01:28 UTC
[R] Reading a file line by line - separating lines VS separating columns
You can do something like this using connections and read in a set of
lines and saving the results in bigmemory, or in this case a 'save'
image:
zz <- file("ex.data", "w") # open an output file
for (i in 1:10000)cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t",
file = zz, sep ="\n")
close(zz)
# read in the data 876 lines at a time and write out an image
zz <- file("ex.data", "r")
fileNo <- 1
repeat{
gotError <- 1 # set to 2 if there is an error
# catch the error if not more data
tryCatch(input <- read.table(zz, nrows=876, sep='\t'),
error=function(x) gotError <<- 2)
if (gotError == 2) break
# save the intermediate data
save(input, file=sprintf("file%03d.RDData", fileNo))
fileNo <- fileNo + 1
}
close(zz)
On Wed, Mar 18, 2009 at 7:17 PM, Tal Galili <tal.galili at gmail.com>
wrote:> Hello all.
>
> I wish to read a large data set into R. ?My current issue is in getting the
> data so that R would be able to access it. ?Using read.table won't work
> since the data is over 1GB in size (and I am using windows XP), so my plan
> was to read the file chunk by chunk and each time move it into bigmemory
> (I'll play with that when the time will come, maybe ff is better ?!).
>
> I encountered a problem with separating lines VS separating columns, to
> which I found a solution but it doesn't feel to be a smart solution,
any
> ideas or help of how to improve this would be welcomed.
>
>
>
> # sample code:
>
> # creating a simple file zz <- file("ex.data", "w")
# open an output file
> connection cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file
= zz, sep > "\n") cat(
"1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file = zz, sep >
"\n") cat( "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t\t555\t\t", file
= zz, sep > "\n") (temp.file = scan("ex.data", what =
"", sep = "\n")) # here we can
> limit the amount of rows we want to use and start from a specific row using
> skip # or: #(aa = readLines("ex.data")) str(aa) # we get a vector
of
> character new.df <- NULL # we go through the vector to split the columns
> for(i in 1:length(aa)) { new.df <- rbind(new.df
> ,unlist(strsplit(temp.file[i], "\t"))) } new.df # or maybe
> apply(as.data.frame(temp.file), 1, function(b) unlist(strsplit(b,
"\t"))) #
> but this transposes the matrix
>
>
> Thanks,
> Tal
>
>
> --
> ----------------------------------------------
>
>
> My contact information:
> Tal Galili
> Phone number: 972-50-3373767
> FaceBook: Tal Galili
> My Blogs:
> http://www.r-statistics.com/
> http://www.talgalili.com
> http://www.biostatistics.co.il
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?