Waichler, Scott R
2014-Aug-09 22:31 UTC
[R] Reading chunks of data from a file more efficiently
Hi, I have some very large (~1.1 GB) output files from a groundwater model called STOMP that I want to read as efficiently as possible. For each variable there are over 1 million values to read. Variables are not organized in columns; instead they are written out in sections in the file, like this: X-Direction Node Positions, m 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05 . . . 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05 Y-Direction Node Positions, m 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 . . . 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 Z-Direction Node Positions, m 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 . . . I want to read and use only a subset of the variables. I wrote the function below to find the line where each target variable begins and then scan the values, but it still seems rather slow, perhaps because I am opening and closing the file for each variable. Can anyone suggest a faster way? # Reads original STOMP plot file (plot.*) directly. Should be useful when the plot files are # very large with lots of variables, and you just want to retrieve a few of them. # Arguments: 1) plot filename, 2) number of nodes, # 3) character vector of names of target variables you want to return. # Returns a list with the selected plot output. READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) { lines <- readLines(plt.file) num.vars <- length(var.names) tmp <- list() for(i in 1:num.vars) { ind <- grep(var.names[i], lines, fixed=T, useBytes=T) if(length(ind) != 1) stop("Not one line in the plot file with matching variable name.\n") tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T) } return(tmp) } # end READ.PLOT.OUTPUT6() Regards, Scott Waichler Pacific Northwest National Laboratory Richland, WA, USA scott.waichler at pnnl.gov
Jeff Newmiller
2014-Aug-10 01:14 UTC
[R] Reading chunks of data from a file more efficiently
Informally abbreviating data is not recommended... I faked some, but would appreciate if you would make your example reproducible next time. All I really did for performance was use the data you read in rather than re-scanning the file. # generated by using dput() lines <- c("X-Direction Node Positions, m", " 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05", " 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05", " 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05", " 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05", "", "Y-Direction Node Positions, m", " 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05", " 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05", " 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05", " 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05", "", "Z-Direction Node Positions, m", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", "", "X-Direction Node Positions, n", " 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05", " 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05", " 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05", " 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05", "", "Y-Direction Node Positions, n", " 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05", " 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05", " 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05", " 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05", "", "Z-Direction Node Positions, n", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", " 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01", "", "") getDimVar <- function( lines, Dim, specifiedvar, starts ) { vstart <- grep( paste0( "^", Dim, "-Direction Node Positions, " , specifiedvar, "$" ), lines ) startv <- match( vstart, starts ) if ( 0 == length( startv ) ) { stop( "Variable ", specifiedvar, " not found" ) } if ( length( starts ) == startv ) { vend <- length( lines ) } else { vend <- starts[ startv + 1 ] - 1 } tcon <- textConnection( lines[ seq( vstart + 1, vend ) ] ) result <- scan( tcon ) close( tcon ) result } starts <- grep( "^[XYZ]-Direction Node Positions, ", lines ) specifiedvar <- "n" n <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts ) , Y=getDimVar( lines, "Y", specifiedvar, starts ) , Z=getDimVar( lines, "Z", specifiedvar, starts ) ) # test a variable that doesn't exist specifiedvar <- "o" o <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts ) , Y=getDimVar( lines, "Y", specifiedvar, starts ) , Z=getDimVar( lines, "Z", specifiedvar, starts ) ) On Sat, 9 Aug 2014, Waichler, Scott R wrote:> Hi, > > I have some very large (~1.1 GB) output files from a groundwater model called STOMP that I want to read as efficiently as possible. For each variable there are over 1 million values to read. Variables are not organized in columns; instead they are written out in sections in the file, like this: > > X-Direction Node Positions, m > 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05 > 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05 > . . . > 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05 > 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05 > > Y-Direction Node Positions, m > 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 > 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 > . . . > 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 > 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 > > Z-Direction Node Positions, m > 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 > 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 > . . . > > I want to read and use only a subset of the variables. I wrote the function below to find the line where each target variable begins and then scan the values, but it still seems rather slow, perhaps because I am opening and closing the file for each variable. Can anyone suggest a faster way? > > # Reads original STOMP plot file (plot.*) directly. Should be useful when the plot files are > # very large with lots of variables, and you just want to retrieve a few of them. > # Arguments: 1) plot filename, 2) number of nodes, > # 3) character vector of names of target variables you want to return. > # Returns a list with the selected plot output. > READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) { > lines <- readLines(plt.file) > num.vars <- length(var.names) > tmp <- list() > for(i in 1:num.vars) { > ind <- grep(var.names[i], lines, fixed=T, useBytes=T) > if(length(ind) != 1) stop("Not one line in the plot file with matching variable name.\n") > tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T) > } > return(tmp) > } # end READ.PLOT.OUTPUT6() > > Regards, > Scott Waichler > Pacific Northwest National Laboratory > Richland, WA, USA > scott.waichler at pnnl.gov > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
peter salzman
2014-Aug-12 01:34 UTC
[R] Reading chunks of data from a file more efficiently
Scott, there is a package called ff that '... provides data structures that are stored on disk but behave (almost) as if they were in RAM ...' i hope it helps peter On Sat, Aug 9, 2014 at 6:31 PM, Waichler, Scott R <Scott.Waichler@pnnl.gov> wrote:> Hi, > > I have some very large (~1.1 GB) output files from a groundwater model > called STOMP that I want to read as efficiently as possible. For each > variable there are over 1 million values to read. Variables are not > organized in columns; instead they are written out in sections in the file, > like this: > > X-Direction Node Positions, m > 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05 > 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05 > . . . > 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05 > 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05 > > Y-Direction Node Positions, m > 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 > 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05 > . . . > 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 > 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05 > > Z-Direction Node Positions, m > 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 > 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01 > . . . > > I want to read and use only a subset of the variables. I wrote the > function below to find the line where each target variable begins and then > scan the values, but it still seems rather slow, perhaps because I am > opening and closing the file for each variable. Can anyone suggest a > faster way? > > # Reads original STOMP plot file (plot.*) directly. Should be useful when > the plot files are > # very large with lots of variables, and you just want to retrieve a few > of them. > # Arguments: 1) plot filename, 2) number of nodes, > # 3) character vector of names of target variables you want to return. > # Returns a list with the selected plot output. > READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) { > lines <- readLines(plt.file) > num.vars <- length(var.names) > tmp <- list() > for(i in 1:num.vars) { > ind <- grep(var.names[i], lines, fixed=T, useBytes=T) > if(length(ind) != 1) stop("Not one line in the plot file with matching > variable name.\n") > tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T) > } > return(tmp) > } # end READ.PLOT.OUTPUT6() > > Regards, > Scott Waichler > Pacific Northwest National Laboratory > Richland, WA, USA > scott.waichler@pnnl.gov > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Peter Salzman, PhD Department of Biostatistics and Computational Biology University of Rochester [[alternative HTML version deleted]]