Robin Jeffries
2010-Jan-21 01:22 UTC
[R] Problems completely reading in a "large" sized data set
I have been through the help file archives a number of times, and still cannot figure out what is wrong. I have a tab-delimited text file. 76Mb, so while it's large.. it's not -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1 When I open this data in Excel, i have 27 rows and 450932 rows, excluding the first row containing variable names. I am trying to get this into R as a dataset for analysis. z<-"Data/media1y.txt" f=file(zz,'r') # open the file rl = readLines(f,1) # Read the first line colnames<-strsplit(rl, '\t') p = length(colnames[[1]]) # counte the number of columns nobs<-450932 close(f) Using: d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p), nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE, dimnames=list(NULL,colnames[[1]])) produces the error Read 5761719 items Warning message: In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what rep("character", : data length [5761719] is not a sub-multiple or multiple of the number of rows [10] Now, 5761719/27 = 213397. If I change nobs<-213397 it reads in the file with no errors. It produces a matrix that I can work with from here. But the file obviously is not complete. At first I thought it might be reading the first x amount of rows. So I sorted by the first variable alphabetically in Excel before saving it as a txt file and reading it into R. head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the entry for the first variable in the last row is [213397,] "WSAH" The 213397th row in Excel, starts with "MM1" and the actual last row starts with "YE". The "WSA" in question can be found on Excel row # 397548 That, confuses the heck out of me. There are no blank lines. Since there are >1000 categories for that first variable, i'm not going to manually match all of the frequencies, but the first 10 were exact, "MM1" was correct, and the last few before "WSA" was also correct. "WSA" itself had 3001 observations in R, whereas Excel has 3093. That also makes it seem that R is stopping reading the table at some point. It shouldn't be a memory issue.... right?> object.size(d1)56328480 bytes> memory.size(max=TRUE)[1] 444.06> memory.size(max=NA)[1] 3583.88> memory.size(max=FALSE)[1] 251.09 As a side question, i'm reading it all in as characters for now because when i tried to define a vector of column types wht <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it still read everything in as character. I'm also not sure about the "" 's, I had to put them in to get list() to even accept that. Or c(). Any ideas with this? Thanks! -- Robin Jeffries Dr.P.H. Candidate Department of Biostatistics UCLA School of Public Health [[alternative HTML version deleted]]
Charles C. Berry
2010-Jan-21 16:41 UTC
[R] Problems completely reading in a "large" sized data set
On Wed, 20 Jan 2010, Robin Jeffries wrote:> I have been through the help file archives a number of times, and still > cannot figure out what is wrong. > I have a tab-delimited text file. 76Mb, so while it's large.. it's not > -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1 > > When I open this data in Excel, i have 27 rows and 450932 rows, excluding > the first row containing variable names. > > I am trying to get this into R as a dataset for analysis. > > z<-"Data/media1y.txt" > f=file(zz,'r') # open the file > rl = readLines(f,1) # Read the first line > colnames<-strsplit(rl, '\t') > p = length(colnames[[1]]) # counte the number of columns > nobs<-450932 > close(f) > > Using: > d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p), > nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE, > dimnames=list(NULL,colnames[[1]])) > > produces the error > Read 5761719 items > Warning message: > In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what > rep("character", : > data length [5761719] is not a sub-multiple or multiple of the number of > rows [10] > > Now, 5761719/27 = 213397. > If I change nobs<-213397 it reads in the file with no errors. It produces a > matrix that I can work with from here. But the file obviously is not > complete. >What does length( grep( '\n', d1 ) ) report?? If non-zero, you have unmatched quotes in some lines. HTH, Chuck> At first I thought it might be reading the first x amount of rows. So I > sorted by the first variable alphabetically in Excel before saving it as a > txt file and reading it into R. > head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the > entry for the first variable in the last row is [213397,] "WSAH" > The 213397th row in Excel, starts with "MM1" and the actual last row starts > with "YE". The "WSA" in question can be found on Excel row # 397548 > > That, confuses the heck out of me. There are no blank lines. > > Since there are >1000 categories for that first variable, i'm not going to > manually match all of the frequencies, but the first 10 were exact, "MM1" > was correct, and the last few before "WSA" was also correct. "WSA" itself > had 3001 observations in R, whereas Excel has 3093. That also makes it seem > that R is stopping reading the table at some point. > > > > It shouldn't be a memory issue.... right? >> object.size(d1) > 56328480 bytes >> memory.size(max=TRUE) > [1] 444.06 >> memory.size(max=NA) > [1] 3583.88 >> memory.size(max=FALSE) > [1] 251.09 > > > > As a side question, i'm reading it all in as characters for now because when > i tried to define a vector of column types wht > <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it > still read everything in as character. I'm also not sure about the "" 's, I > had to put them in to get list() to even accept that. Or c(). Any ideas with > this? > > Thanks! > > -- > Robin Jeffries > Dr.P.H. Candidate > Department of Biostatistics > UCLA School of Public Health > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
jim holtman
2010-Jan-21 16:53 UTC
[R] Problems completely reading in a "large" sized data set
Just use what='' in your scan since all your data appears to be character. Also use comment.char='', quote='' if there is the possibility of a misplaced quote or a comment character. I have not problem reading in files of that size. Also use 'count.fields' to see what your file looks like. On Wed, Jan 20, 2010 at 8:22 PM, Robin Jeffries <rjeffries at ucla.edu> wrote:> I have been through the help file archives a number of times, and still > cannot figure out what is wrong. > I have a tab-delimited text file. 76Mb, so while it's large.. it's not > -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1 > > When I open this data in Excel, i have 27 rows and 450932 rows, excluding > the first row containing variable names. > > I am trying to get this into R as a dataset for analysis. > > z<-"Data/media1y.txt" > f=file(zz,'r') # open the file > rl = readLines(f,1) # Read the first line > colnames<-strsplit(rl, '\t') > p = length(colnames[[1]]) # counte the number of columns > nobs<-450932 > close(f) > > Using: > d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p), > nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE, > dimnames=list(NULL,colnames[[1]])) > > produces the error > Read 5761719 items > Warning message: > In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what > rep("character", ?: > ?data length [5761719] is not a sub-multiple or multiple of the number of > rows [10] > > Now, 5761719/27 = 213397. > If I change nobs<-213397 it reads in the file with no errors. It produces a > matrix that I can work with from here. But the file obviously is not > complete. > > At first I thought it might be reading the first x amount of rows. So I > sorted by the first variable alphabetically in Excel before saving it as a > txt file and reading it into R. > head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the > entry for the first variable in the last row is [213397,] "WSAH" > The 213397th row in Excel, starts with "MM1" and the actual last row starts > with "YE". The "WSA" in question can be found on Excel row # 397548 > > That, confuses the heck out of me. There are no blank lines. > > Since there are >1000 categories for that first variable, i'm not going to > manually match all of the frequencies, but the first 10 were exact, "MM1" > was correct, and the last few before "WSA" was also correct. "WSA" itself > had 3001 observations in R, whereas Excel has 3093. That also makes it seem > that R is stopping reading the table at some point. > > > > It shouldn't be a memory issue.... right? >> object.size(d1) > 56328480 bytes >> memory.size(max=TRUE) > [1] 444.06 >> memory.size(max=NA) > [1] 3583.88 >> memory.size(max=FALSE) > [1] 251.09 > > > > As a side question, i'm reading it all in as characters for now because when > i tried to define a vector of column types wht > <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it > still read everything in as character. I'm also not sure about the "" 's, I > had to put them in to get list() to even accept that. Or c(). Any ideas with > this? > > Thanks! > > -- > Robin Jeffries > Dr.P.H. Candidate > Department of Biostatistics > UCLA School of Public Health > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?