*Dear R users, Ive just started using the ff package. There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only column from the file, skipping the first 100 rows. Below Ive provided different outcomes, which will clarify my problem *> sessionInfo()R version 2.14.2 (2012-02-29) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: ... attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ff_2.2-7 bit_1.1-8 ##--------------------------------------------------------------------------------------- ## *I want to read the second column only:* x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') ##* The following command works fine:*> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, > colClasses=x.class, nrows=1e3)ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix V2 V2 double double FALSE FALSE PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol V2 FALSE 1 1 1 PhysicalIsOpen V2 TRUE ffdf data V2 1 -0.5412 2 -0.5842 3 -0.5920 4 -0.5451 5 -0.5099 6 -0.5021 7 -0.4943 8 -0.5490 : : 993 -0.4865 994 -0.6584 995 -0.7482 996 -0.8732 997 -0.8303 998 -0.7248 999 -0.5490 1000 -0.4240 *Then I extend nrows by 1, I get warning about number of columns:*> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, > colClasses=x.class, nrows=1001)ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix V2 V2 double double FALSE FALSE PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol V2 FALSE 1 1 1 PhysicalIsOpen V2 TRUE ffdf data V2 1 -0.5412 2 -0.5842 3 -0.5920 4 -0.5451 5 -0.5099 6 -0.5021 7 -0.4943 8 -0.5490 : : 994 -0.6584 995 -0.7482 996 -0.8732 997 -0.8303 998 -0.7248 999 -0.5490 1000 -0.4240 1001 -0.3849 Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : cols = 1 != length(data) = 7>*Then, going much beyond 1000 brings problems:*> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, > colClasses=x.class, nrows=1e4)Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names *Question is why? The number of columns does not change in the file... I will appreciate any help.. Best, Robert * -- View this message in context: http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794.html Sent from the R help mailing list archive at Nabble.com.
Having had a quick look at the source code for read.table.ffdf, I suspect that using 'NULL' in the colClasses argument is not allowed. Could you try to see if you can use read.table.ffdf with specifying the colClasses for all columns (thereby reading in all columns in the file)? If that works, you can be quite sure that indeed that number of columns is constant in the file (sometimes a ' or unquoted , can mess things up). Jan threshold <r.kozarski at gmail.com> schreef:> *Dear R users, Ive just started using the ff package. > > There is a csv file (~4Gb) with 7 columns and 6e+7 rows. I want to read only > column from the file, skipping the first 100 rows. > Below Ive provided different outcomes, which will clarify my problem > * >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > ... > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] ff_2.2-7 bit_1.1-8 > > ##--------------------------------------------------------------------------------------- > ## *I want to read the second column only:* > x.class <- c('NULL', 'numeric','NULL','NULL','NULL', 'NULL', 'NULL') > > ##* The following command works fine:* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1e3) > ffdf (all open) dim=c(1000,1), dimorder=c(1,2) row.names=NULL > ffdf virtual mapping > PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix > V2 V2 double double FALSE FALSE > PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol > V2 FALSE 1 1 1 > PhysicalIsOpen > V2 TRUE > ffdf data > V2 > 1 -0.5412 > 2 -0.5842 > 3 -0.5920 > 4 -0.5451 > 5 -0.5099 > 6 -0.5021 > 7 -0.4943 > 8 -0.5490 > : : > 993 -0.4865 > 994 -0.6584 > 995 -0.7482 > 996 -0.8732 > 997 -0.8303 > 998 -0.7248 > 999 -0.5490 > 1000 -0.4240 > > *Then I extend nrows by 1, I get warning about number of columns:* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1001) > ffdf (all open) dim=c(1001,1), dimorder=c(1,2) row.names=NULL > ffdf virtual mapping > PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix > V2 V2 double double FALSE FALSE > PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol > V2 FALSE 1 1 1 > PhysicalIsOpen > V2 TRUE > ffdf data > V2 > 1 -0.5412 > 2 -0.5842 > 3 -0.5920 > 4 -0.5451 > 5 -0.5099 > 6 -0.5021 > 7 -0.4943 > 8 -0.5490 > : : > 994 -0.6584 > 995 -0.7482 > 996 -0.8732 > 997 -0.8303 > 998 -0.7248 > 999 -0.5490 > 1000 -0.4240 > 1001 -0.3849 > Warning message: > In read.table(file = file, header = header, sep = sep, quote = quote, : > cols = 1 != length(data) = 7 >> > > *Then, going much beyond 1000 brings problems:* >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, >> colClasses=x.class, nrows=1e4) > Error in read.table(file = file, header = header, sep = sep, quote = quote, > : > more columns than column names > > *Question is why? The number of columns does not change in the file... > > I will appreciate any help.. > > > Best, Robert > > * > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
*..plus I get the following message after reading the whole set (all 7 columns):*> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000, > next.rows=1e7, VERBOSE=TRUE)read.table.ffdf 1..1000 (1000) csv-read=0.02sec ffdf-write=0.08sec read.table.ffdf 1001..10001000 (10000000) csv-read=282.16sec ffdf-write=65.01sec read.table.ffdf 10001001..20001000 (10000000) csv-read=240.3sec ffdf-write=63.84sec read.table.ffdf 20001001..30001000 (10000000) csv-read=213.78sec ffdf-write=149.2sec read.table.ffdf 30001001..40001000 (10000000) csv-read=217.36sec ffdf-write=379.8sec read.table.ffdf 40001001..50001000 (10000000) csv-read=541.28secError: cannot allocate vector of size 381.5 Mb In addition: There were 14 warnings (use warnings() to see them)> warnings()Warning messages: 1: In match(levels(x), lev) : Reached total allocation of 7987Mb: see help(memory.size) 2: In match(levels(x), lev) : Reached total allocation of 7987Mb: see help(memory.size) -- View this message in context: http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637900.html Sent from the R help mailing list archive at Nabble.com.
You probably have a character (which is converted to factor) or factor column with a large number of distinct values. All the levels of a factor are stored in memory in ff. Jan threshold <r.kozarski at gmail.com> schreef:> *..plus I get the following message after reading the whole set (all 7 > columns):* > >> read.csv.ffdf(file=csvfile, header=FALSE, skip=100, first.rows=1000, >> next.rows=1e7, VERBOSE=TRUE) > > read.table.ffdf 1..1000 (1000) csv-read=0.02sec ffdf-write=0.08sec > read.table.ffdf 1001..10001000 (10000000) csv-read=282.16sec > ffdf-write=65.01sec > read.table.ffdf 10001001..20001000 (10000000) csv-read=240.3sec > ffdf-write=63.84sec > read.table.ffdf 20001001..30001000 (10000000) csv-read=213.78sec > ffdf-write=149.2sec > read.table.ffdf 30001001..40001000 (10000000) csv-read=217.36sec > ffdf-write=379.8sec > read.table.ffdf 40001001..50001000 (10000000) csv-read=541.28secError: > cannot allocate vector of size 381.5 Mb > In addition: There were 14 warnings (use warnings() to see them) >> warnings() > Warning messages: > 1: In match(levels(x), lev) : > Reached total allocation of 7987Mb: see help(memory.size) > 2: In match(levels(x), lev) : > Reached total allocation of 7987Mb: see help(memory.size) > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/ff-package-reading-selected-columns-from-csv-tp4637794p4637900.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- How to specify ff object filepaths when reading a CSV file into a ff data frame.
- ffdfindexget from package ff
- Can't import this 4GB DATASET
- ffsave problems
- Any way to get read.table.ffdf() (in the ff package) to pass colClasses or comment.char parameters through to read.fwf() ?