Earl F. Glynn
2010-Oct-06 07:15 UTC
[R] Help troubleshooting silent failure reading huge file with read.delim
I am trying to read a tab-delimited 1.25 GB file of 4,115,119 records each with 52 fields. I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory. I have tried the two following statements with the same results: d <- read.delim(filename, as.is=TRUE) d <- read.delim(filename, as.is=TRUE, nrows=4200000) I have tried starting R with this parameter but that changed nothing: --max-mem-size=6GB Everything appeared to have worked fine until I studied frequency counts of the fields and realized data were missing.> dim(d)[1] 3388444 52 R read 3,388,444 records and missed 726,754 records. There were no error messages or exceptions. I plotted a chart using the data and later discovered not all the data were represented in the chart. R didn't just read the first 3,388,444 records and quit. Here's what I believe happened (based on frequency counts of the first field in the data.frame from R, and independently from another source): * R read the first 1,866,296 records and then skipped 419,340 records. * Next, R read 1,325,552 records and skipped 307,414 records. * R read the last 196,596 records without any problems. Questions: Is there some memory-related parameter that I should adjust that might explain the observed details above? Shouldn't read.delim catch this failure instead of being silent about dropping data? Thanks for any help with this. Earl F Glynn Overland Park, KS
David Winsemius
2010-Oct-06 09:25 UTC
[R] Help troubleshooting silent failure reading huge file with read.delim
On Oct 6, 2010, at 3:15 AM, Earl F. Glynn wrote:> > I am trying to read a tab-delimited 1.25 GB file of 4,115,119 > records each > with 52 fields. > > I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory. > > I have tried the two following statements with the same results: > > d <- read.delim(filename, as.is=TRUE) > > d <- read.delim(filename, as.is=TRUE, nrows=4200000) > > I have tried starting R with this parameter but that changed nothing: > --max-mem-size=6GB > > Everything appeared to have worked fine until I studied frequency > counts of > the fields and realized data were missing. > >> dim(d) > [1] 3388444 52 > > R read 3,388,444 records and missed 726,754 records. There were no > error > messages or exceptions. I plotted a chart using the data and later > discovered not all the data were represented in the chart. > > R didn't just read the first 3,388,444 records and quit. > > Here's what I believe happened (based on frequency counts of the > first field > in the data.frame from R, and independently from another source): > * R read the first 1,866,296 records and then skipped 419,340 records. > * Next, R read 1,325,552 records and skipped 307,414 records. > * R read the last 196,596 records without any problems. > > Questions: > > Is there some memory-related parameter that I should adjust that might > explain the observed details above? > > Shouldn't read.delim catch this failure instead of being silent about > dropping data? > > Thanks for any help with this. > > Earl F Glynn > Overland Park, KS > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
David Winsemius
2010-Oct-06 09:30 UTC
[R] Help troubleshooting silent failure reading huge file with read.delim
Apologies for the blank post. Too little caffeine at 5:30 AM. On Oct 6, 2010, at 3:15 AM, Earl F. Glynn wrote:> > I am trying to read a tab-delimited 1.25 GB file of 4,115,119 > records each > with 52 fields. > > I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory. > > I have tried the two following statements with the same results: > > d <- read.delim(filename, as.is=TRUE) > > d <- read.delim(filename, as.is=TRUE, nrows=4200000) > > I have tried starting R with this parameter but that changed nothing: > --max-mem-size=6GB > > Everything appeared to have worked fine until I studied frequency > counts of > the fields and realized data were missing. > >> dim(d) > [1] 3388444 52 > > R read 3,388,444 records and missed 726,754 records. There were no > error > messages or exceptions. I plotted a chart using the data and later > discovered not all the data were represented in the chart. > > R didn't just read the first 3,388,444 records and quit. > > Here's what I believe happened (based on frequency counts of the > first field > in the data.frame from R, and independently from another source): > * R read the first 1,866,296 records and then skipped 419,340 records. > * Next, R read 1,325,552 records and skipped 307,414 records. > * R read the last 196,596 records without any problems. > > Questions: > > Is there some memory-related parameter that I should adjust that might > explain the observed details above?Can't think of any.> > Shouldn't read.delim catch this failure instead of being silent about > dropping data?More likely you have mismatched quotes in your file and some fields are accumulating large amounts of text. You should do some tabulations on your text fields with nchar-based functions.> > Thanks for any help with this. > > Earl F Glynn > Overland Park, KS-- David Winsemius, MD West Hartford, CT
jim holtman
2010-Oct-06 16:08 UTC
[R] Help troubleshooting silent failure reading huge file with read.delim
Beside the mismatched quotes, another I had with a file is some illegal characters (0x1A in this case) signaled an end of read when reading. On Wed, Oct 6, 2010 at 3:15 AM, Earl F. Glynn <efglynn at gmail.com> wrote:> > I am trying to read a tab-delimited 1.25 GB file of 4,115,119 records each > with 52 fields. > > I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory. > > I have tried the two following statements with the same results: > > d <- read.delim(filename, as.is=TRUE) > > d <- read.delim(filename, as.is=TRUE, nrows=4200000) > > I have tried starting R with this parameter but that changed nothing: > --max-mem-size=6GB > > Everything appeared to have worked fine until I studied frequency counts of > the fields and realized data were missing. > >> dim(d) > [1] 3388444 ? ? ?52 > > R read 3,388,444 records and missed 726,754 records. ?There were no error > messages or exceptions. ?I plotted a chart using the data and later > discovered not all the data were represented in the chart. > > R didn't just read the first 3,388,444 records and quit. > > Here's what I believe happened (based on frequency counts of the first field > in the data.frame from R, and independently from another source): > * R read the first 1,866,296 records and then skipped 419,340 records. > * Next, R read 1,325,552 records and skipped 307,414 records. > * R read the last 196,596 records without any problems. > > Questions: > > Is there some memory-related parameter that I should adjust that might > explain the observed details above? > > Shouldn't read.delim catch this failure instead of being silent about > dropping data? > > Thanks for any help with this. > > Earl F Glynn > Overland Park, KS > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?