christian.kamenik at astra.admin.ch
2013-Aug-06 08:47 UTC
[R] read.table.ffdf and fixed width files
Dear all I am working on Windows 7 32-bit, and the ff- package is my daily life-saver to overcome the inherent memory limitations. Recently, I tried using read.table.ffdf to import data from a fixed-width ASCII file (file size: 1'440'865'015 Bytes) with 6'079'455 lines and 32 variables using the command read.table.ffdf(file=my.filename, FUN="read.fwf", width=my.format, asffdf_args=list(col_args=list(pattern = my.pattern)) The command generates a temporary file, which has 1'629'328'120 Bytes, plus 32 ff files following my.pattern. The latter 32 files, however, only take up 136'000 Bytes. And the resulting R object has a dimension of 1000 x 32. To me, it seems that read.table.ffdf aborts the data import after 1000 lines, instead of importing the entire file. I tried running read.table.ffdf with different parameter settings, I was browsing the help pages and the mailing lists, but I did not find any hint on why read.table.ffdf aborts the data import. (Does it really? - The file size of the temporary file suggests that all data were read.) Any help would be highly appreciated Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch> www.astra.admin.ch<http://www.astra.admin.ch/> [[alternative HTML version deleted]]
What probably is the problem is that read.table.ffdf uses the nrows argument to read the file in chunks. However, read.fwf doesn't use a nrow argument but a n argument. One (non tested) solution is to write a wrapper around read.fwf and pass this wrapper to read.table.ffwf. Something like: my.read.fwf <- function(file, nrow=-1, ...) { read.fwf(file=file, n=nrow, ...) } Perhaps you'll also need to wrap some additional arguments. read.fwf is terribly slow for large fixed width files. I would advise to use the LaF package in combination with the laf_to_ffwf function from the ffbase package. ... Although judging from your other question you already looked at that. HTH, Jan On 08/06/2013 10:47 AM, christian.kamenik at astra.admin.ch wrote:> Dear all > > I am working on Windows 7 32-bit, and the ff- package is my daily life-saver to overcome the inherent memory limitations. Recently, I tried using read.table.ffdf to import data from a fixed-width ASCII file (file size: 1'440'865'015 Bytes) with 6'079'455 lines and 32 variables using the command > read.table.ffdf(file=my.filename, FUN="read.fwf", width=my.format, asffdf_args=list(col_args=list(pattern = my.pattern)) > > The command generates a temporary file, which has 1'629'328'120 Bytes, plus 32 ff files following my.pattern. The latter 32 files, however, only take up 136'000 Bytes. And the resulting R object has a dimension of 1000 x 32. To me, it seems that read.table.ffdf aborts the data import after 1000 lines, instead of importing the entire file. > > I tried running read.table.ffdf with different parameter settings, I was browsing the help pages and the mailing lists, but I did not find any hint on why read.table.ffdf aborts the data import. (Does it really? - The file size of the temporary file suggests that all data were read.) > > Any help would be highly appreciated > > Best Regard > > Christian Kamenik > Project Manager > > Federal Department of the Environment, Transport, Energy and Communications DETEC > Federal Roads Office FEDRO > Division Road Traffic > Road Accident Statistics > > Mailing Address: 3003 Bern > Location: Weltpoststrasse 5, 3015 Bern > > Tel +41 31 323 14 89 > Fax +41 31 323 43 21 > > christian.kamenik at astra.admin.ch<mailto:christian.kamenik at astra.admin.ch> > www.astra.admin.ch<http://www.astra.admin.ch/> > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
christian.kamenik at astra.admin.ch
2013-Aug-12 13:40 UTC
[R] read.table.ffdf and fixed width files
Dear R Users This is a summary of the things I tried with read.table.ffdf and fixed-width files. I would like to thank Jan Wijffels and Jan van der Laan for their suggestions and the time they spent on my problem! My objective was to import a file with 6'079'455 lines and 32 variables using the tools provided by the ff package. The "fixed-width file" I got was supposed to have a total width of 238. But it turned out that the last column, which should have had a width of four, contained either no entry, or entries with one or two characters followed by \n\r. The corresponding spaces were dropped when the file was created. This could be shown by lines <- readLines("my_file.txt") range(nchar(lines)) which resulted in 235 237 instead of 238. So the file was not really fixed width... I tried importing the file with library(ff) library(stringr) my.data <- read.table.ffdf(file="my_file.txt", FUN="read.fwf", widths = my.widths, header=F, VERBOSE=TRUE, first.rows=100000, col.names = my.names, fileEncoding = "LATIN1", transFUN=function(x){ z <- sapply(x, function(y) { y <- str_trim(y) y[y==""] <- NA factor(y)}) as.data.frame(z) } ) This took 4168 seconds and resulted in an object that included only 100'000 lines instead of 6'079'455 lines (I still don't know why...). Another approach was to use laf_open_fwf from package LaF and then laf_to_ffdf from package ffbase, which is really a simple approach as long as the width is not shorter than the given width (i.e. 238). So the idea was to add the missing spaces by running con <- file("my_file.txt", "rt") out <- file("my_file_converted.txt", "wt") system.time( while (TRUE) { lines <- readLines(con, encoding='LATIN1', n=1E5) if (length(lines) == 0) break lines <- sprintf("%-238s", lines) writeLines(lines, out, useBytes=TRUE) } ) close(con) close(out) and then library(LaF) library(ffbase) my.data.laf <- laf_open_fwf("my_file_converted.txt ", column_types=my.types, column_widths = my.widths, column_names = my.names) my.data <- laf_to_ffdf(my.data.laf) This worked really well, except that the whole process took quite some time. Appending the spaces took 2436 seconds, and converting the file from laf to ffdf took another 2628 seconds. The third approach I tested was the fastest, but used the Unix/Linux program awk outside R (run on Cygwin installed on Windows 7 32-bit): First, I converted my original file into a tab-delimited text file using awk: awk -v FIELDWIDTHS='3 28 4 30 28 6 3 30 10 3 3 6 6 5 1 2 1 1 2 2 2 4 2 4 7 30 1 1 3 2 4 4' -v OFS='\t' '{ $1=$1 ""; print }' <my_file.txt> my_file_delimited.txt Then I used read.delim.ffdf provided by the ff package: library(ff) library(stringr) my.data <- read.delim.ffdf(file="my_file_delimited.txt", header=F, VERBOSE=TRUE, first.rows=100000, col.names = my.names, colClasses=my.classes, fileEncoding = "LATIN1", transFUN=function(x) { z <- sapply(x, function(y) { y <- str_trim(y) y[y==""] <- NA factor(y)}) as.data.frame(z) } ) Running awk took only 203 seconds! And the import of the delimited file was finished after 1141 seconds. What I like most about the variants of read.table.ffdf and also about laf_to_ffdf is the fransFUN argument! Have a look at it, it allows a lot of fine tuning. Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch> www.astra.admin.ch<http://www.astra.admin.ch/> Von: Jan Wijffels [mailto:jwijffels@bnosac.be] Gesendet: Donnerstag, 8. August 2013 11:46 An: Kamenik Christian ASTRA Betreff: Re: read.table.ffdf and fixed width files Christian, You probably misspecified column names in the transFUN. Mark that read.table.ffdf reads in your data in chunks and puts that chunk to an ffdf. In transFUN you get one chunk in RAM based on which you can do data manipulations. It should return a data.frame which will be appended to your ffdf. So. This worked out fine for me. Jan require(ff) fwffile <- "/home/janw/Desktop/testdata.txt" info <- list() info$widths <- c(3,28,4,30,28,6,3,30,10,3,3,6,6,5,1,2,1,1,2,2,2,4,2,4,7,30,1,1,3,2,4,4) info$colnames <- paste("column", 1:length(info$widths)) x <- read.table.ffdf(file=fwffile, FUN="read.fwf", ## Change widths and col.names accordingly widths=info$widths, col.names = info$colnames, fileEncoding = "LATIN1", transFUN=function(x){ ## Handle fixed width spaces x$column.2 <- factor(gsub(" *$", "", x$column.2)) x$column.5 <- factor(gsub(" *$", "", x$column.5)) x }) class(x) str(as.data.frame(x)) 2013/8/8 <christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch>> Hi Jan, With the following column widths 3 28 4 30 28 6 3 30 10 3 3 6 6 5 1 2 1 1 2 2 2 4 2 4 7 30 1 1 3 2 4 4 I got this error: Error in `$<-.data.frame`(`*tmp*`, "b", value = integer(0)) : replacement has 0 rows, data has 1000 Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch> www.astra.admin.ch<http://www.astra.admin.ch/> Von: Jan Wijffels [mailto:jwijffels@bnosac.be<mailto:jwijffels@bnosac.be>] Gesendet: Donnerstag, 8. August 2013 11:15 An: Kamenik Christian ASTRA Betreff: Re: read.table.ffdf and fixed width files Hi Christian, The following worked out for me on your data. best, Jan require(ff) fwffile <- "/home/janw/Desktop/testdata.txt" x <- read.table.ffdf(file=fwffile, FUN="read.fwf", ## Change widths and col.names accordingly widths=c(3,28,4,30,281,142), col.names = c("a","b","c","e","f","g"), fileEncoding = "LATIN1", transFUN=function(x){ ## Handle fixed width spaces x$b <- factor(gsub(" *$", "", x$b)) x$e <- factor(gsub(" *$", "", x$e)) x }) class(x) str(as.data.frame(x)) 2013/8/8 <christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch>> Hello Jan Many thanks for your help! I attached part of the file. Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch> www.astra.admin.ch<http://www.astra.admin.ch/> Von: Jan Wijffels [mailto:jwijffels@bnosac.be<mailto:jwijffels@bnosac.be>] Gesendet: Donnerstag, 8. August 2013 08:53 An: Kamenik Christian ASTRA Betreff: Re: read.table.ffdf and fixed width files Hello Christian, If you need further help, it would probably be best to send a part of the file. So that we can check out if read.table.ffdf or LaF will both work. Jan 2013/8/8 <christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch>> Hi Jan Many thanks for your response. LaF is a great package, and I tried your suggestion using it. But I ran into another problem (see my recent post on "laf_open_fwf"). Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kamenik@astra.admin.ch<mailto:christian.kamenik@astra.admin.ch> www.astra.admin.ch<http://www.astra.admin.ch/> Von: Jan Wijffels [mailto:jwijffels@bnosac.be<mailto:jwijffels@bnosac.be>] Gesendet: Mittwoch, 7. August 2013 10:28 An: Kamenik Christian ASTRA Betreff: read.table.ffdf and fixed width files Hi Christian, Regarding your question on R-help about 'read.table.ffdf and fixed width files'. Maybe this post is of interest to you http://r.789695.n4.nabble.com/Any-way-to-get-read-table-ffdf-in-the-ff-package-to-pass-colClasses-or-comment-char-parameters-throu-td4643171.html For my purposes, I regularly use the package LaF together with ffbase. If you set up the definition of your fixed width as a LaF object and use laf_to_ffdf from package ffbase, you have your dataset immediately in an ffdf. groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be<http://www.bnosac.be> | +32 486 611708 -- groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be<http://www.bnosac.be> | +32 486 611708 -- groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be<http://www.bnosac.be> | +32 486 611708 -- groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be<http://www.bnosac.be> | +32 486 611708 [[alternative HTML version deleted]]