thr3ads.net - R help - [R] Reading chunks of data from a file more efficiently [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Waichler, Scott R

2014-Aug-09 22:31 UTC

[R] Reading chunks of data from a file more efficiently

Hi,

I have some very large (~1.1 GB) output files from a groundwater model called
STOMP that I want to read as efficiently as possible.  For each variable there
are over 1 million values to read.  Variables are not organized in columns;
instead they are written out in sections in the file, like this:

X-Direction Node Positions, m
 5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05
 5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05
. . . 
 5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05
 5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05

Y-Direction Node Positions, m
 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
. . . 
 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05

Z-Direction Node Positions, m
 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
. . .

I want to read and use only a subset of the variables.  I wrote the function
below to find the line where each target variable begins and then scan the
values, but it still seems rather slow, perhaps because I am opening and closing
the file for each variable.  Can anyone suggest a faster way?

# Reads original STOMP plot file (plot.*) directly.  Should be useful when the
plot files are
# very large with lots of variables, and you just want to retrieve a few of
them.
# Arguments:  1) plot filename, 2) number of nodes, 
# 3) character vector of names of target variables you want to return.
# Returns a list with the selected plot output.
READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) {
  lines <- readLines(plt.file)
  num.vars <- length(var.names)
  tmp <- list()
  for(i in 1:num.vars) {
    ind <- grep(var.names[i], lines, fixed=T, useBytes=T)
    if(length(ind) != 1) stop("Not one line in the plot file with matching
variable name.\n")
    tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T)
  }
  return(tmp)
}  # end READ.PLOT.OUTPUT6()

Regards,
Scott Waichler
Pacific Northwest National Laboratory
Richland, WA, USA
scott.waichler at pnnl.gov

Jeff Newmiller

2014-Aug-10 01:14 UTC

head link

[R] Reading chunks of data from a file more efficiently

Informally abbreviating data is not recommended... I faked some, but would 
appreciate if you would make your example reproducible next time.

All I really did for performance was use the data you read in rather than 
re-scanning the file.

# generated by using dput()
lines <- c("X-Direction Node Positions, m",
" 5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05",
" 5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05",
" 5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05",
" 5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05",
"",
"Y-Direction Node Positions, m",
" 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05",
" 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05",
" 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05",
" 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05",
"",
"Z-Direction Node Positions, m",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
"",
"X-Direction Node Positions, n",
" 5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05",
" 5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05",
" 5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05",
" 5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05",
"",
"Y-Direction Node Positions, n",
" 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05",
" 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05",
" 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05",
" 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05",
"",
"Z-Direction Node Positions, n",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
" 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01",
"", "")

getDimVar <- function( lines, Dim, specifiedvar, starts ) {
   vstart <- grep( paste0( "^", Dim, "-Direction Node
Positions, "
                         , specifiedvar, "$" ), lines )
   startv <- match( vstart, starts )
   if ( 0 == length( startv ) ) {
     stop( "Variable ", specifiedvar, " not found" )
   }
   if ( length( starts ) == startv ) {
     vend <- length( lines )
   } else {
     vend <- starts[ startv + 1 ] - 1
   }
   tcon <- textConnection( lines[ seq( vstart + 1, vend ) ] )
   result <- scan( tcon )
   close( tcon )
   result
}

starts <- grep( "^[XYZ]-Direction Node Positions, ", lines )

specifiedvar <- "n"
n <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts )
                , Y=getDimVar( lines, "Y", specifiedvar, starts )
                , Z=getDimVar( lines, "Z", specifiedvar, starts ) )

# test a variable that doesn't exist
specifiedvar <- "o"
o <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts )
                , Y=getDimVar( lines, "Y", specifiedvar, starts )
                , Z=getDimVar( lines, "Z", specifiedvar, starts ) )


On Sat, 9 Aug 2014, Waichler, Scott R wrote:
> Hi,
>
> I have some very large (~1.1 GB) output files from a groundwater model
called STOMP that I want to read as efficiently as possible.  For each variable
there are over 1 million values to read.  Variables are not organized in
columns; instead they are written out in sections in the file, like this:
>
> X-Direction Node Positions, m
> 5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05
> 5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05
> . . .
> 5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05
> 5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05
>
> Y-Direction Node Positions, m
> 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
> 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
> . . .
> 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
> 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
>
> Z-Direction Node Positions, m
> 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
> 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
> . . .
>
> I want to read and use only a subset of the variables.  I wrote the
function below to find the line where each target variable begins and then scan
the values, but it still seems rather slow, perhaps because I am opening and
closing the file for each variable.  Can anyone suggest a faster way?
>
> # Reads original STOMP plot file (plot.*) directly.  Should be useful when
the plot files are
> # very large with lots of variables, and you just want to retrieve a few of
them.
> # Arguments:  1) plot filename, 2) number of nodes,
> # 3) character vector of names of target variables you want to return.
> # Returns a list with the selected plot output.
> READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) {
>  lines <- readLines(plt.file)
>  num.vars <- length(var.names)
>  tmp <- list()
>  for(i in 1:num.vars) {
>    ind <- grep(var.names[i], lines, fixed=T, useBytes=T)
>    if(length(ind) != 1) stop("Not one line in the plot file with
matching variable name.\n")
>    tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T)
>  }
>  return(tmp)
> }  # end READ.PLOT.OUTPUT6()
>
> Regards,
> Scott Waichler
> Pacific Northwest National Laboratory
> Richland, WA, USA
> scott.waichler at pnnl.gov
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

peter salzman

2014-Aug-12 01:34 UTC

head link

[R] Reading chunks of data from a file more efficiently

Scott,

there is a package called ff that '... provides data structures that are
stored on disk but behave (almost) as if they were in RAM ...'

i hope it helps
peter


On Sat, Aug 9, 2014 at 6:31 PM, Waichler, Scott R
<Scott.Waichler@pnnl.gov>
wrote:
> Hi,
>
> I have some very large (~1.1 GB) output files from a groundwater model
> called STOMP that I want to read as efficiently as possible.  For each
> variable there are over 1 million values to read.  Variables are not
> organized in columns; instead they are written out in sections in the file,
> like this:
>
> X-Direction Node Positions, m
>  5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05
>  5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05
> . . .
>  5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05
>  5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05
>
> Y-Direction Node Positions, m
>  1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
>  1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
> . . .
>  1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
>  1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
>
> Z-Direction Node Positions, m
>  9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
>  9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
> . . .
>
> I want to read and use only a subset of the variables.  I wrote the
> function below to find the line where each target variable begins and then
> scan the values, but it still seems rather slow, perhaps because I am
> opening and closing the file for each variable.  Can anyone suggest a
> faster way?
>
> # Reads original STOMP plot file (plot.*) directly.  Should be useful when
> the plot files are
> # very large with lots of variables, and you just want to retrieve a few
> of them.
> # Arguments:  1) plot filename, 2) number of nodes,
> # 3) character vector of names of target variables you want to return.
> # Returns a list with the selected plot output.
> READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) {
>   lines <- readLines(plt.file)
>   num.vars <- length(var.names)
>   tmp <- list()
>   for(i in 1:num.vars) {
>     ind <- grep(var.names[i], lines, fixed=T, useBytes=T)
>     if(length(ind) != 1) stop("Not one line in the plot file with
matching
> variable name.\n")
>     tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T)
>   }
>   return(tmp)
> }  # end READ.PLOT.OUTPUT6()
>
> Regards,
> Scott Waichler
> Pacific Northwest National Laboratory
> Richland, WA, USA
> scott.waichler@pnnl.gov
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Peter Salzman, PhD
Department of Biostatistics and Computational Biology
University of Rochester

	[[alternative HTML version deleted]]

R help - Aug 2014 - Reading chunks of data from a file more efficiently

[R] Reading chunks of data from a file more efficiently

[R] Reading chunks of data from a file more efficiently

[R] Reading chunks of data from a file more efficiently