This looks possible in R, and your algorithm looks precise enough. The help pages for file, readLines, and scan should cast some light. For jobs like this I tend to use Perl, however. Familiarity is one reason: I'm more comfortable with Perl for scanning/parsing files. Also, Perl was originally written for exactly this sort of thing. Cheers Jason -- Indigo Industrial Controls Ltd. 64-21-343-545 jasont at indigoindustrial.co.nz -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear R-Help, I have a generated file that looks like the following: ----- Begin file ----- # # Output File # float Version 2002.700000000000 int Numdays 31 int NumOFEs 1 # # Hillslope-specific variables # char HillVarNames[ 3 ] {Days In Simulation} {Hillslope: Precipitation (mm)} {Hillslope: Average detachment (kg/m**2)} # # OFE-specific variables # char OFEVarNames[ 3 ] {Irrigation depth (mm)} {Irrigation_volume_supplied/unit_area (mm)} {Runoff (mm)} # # Daily values: # 1 5.40000 0.00000 0.00000 0.00000 0.00000 2 0.00000 0.00000 0.00000 0.00000 0.00000 3 2.30000 0.00000 0.00000 0.00000 0.00000 4 0.00000 0.00000 0.00000 0.00000 0.00000 5 0.00000 0.00000 0.00000 0.00000 0.00000 6 0.00000 0.00000 0.00000 0.00000 0.00000 7 0.00000 0.00000 0.00000 0.00000 0.00000 8 0.00000 0.00000 0.00000 0.00000 0.00000 9 12.80000 0.00000 0.00000 4.57200 0.00000 10 0.00000 0.00000 0.00000 0.00000 0.00000 11 0.00000 0.00000 0.00000 0.00000 0.00000 12 0.00000 0.00000 0.00000 0.00000 0.00000 13 0.00000 0.00000 0.00000 0.00000 0.00000 14 0.00000 0.00000 0.00000 0.00000 0.00000 15 0.00000 0.00000 0.00000 0.00000 0.00000 16 0.00000 0.00000 0.00000 0.00000 0.00000 17 0.00000 0.00000 0.00000 0.00000 0.00000 18 0.00000 0.00000 0.00000 0.00000 0.00000 19 0.00000 0.00000 0.00000 0.00000 0.00000 20 0.00000 0.00000 0.00000 0.00000 0.00000 21 0.00000 0.00000 0.00000 0.00000 0.00000 22 0.00000 0.00000 0.00000 0.00000 0.00000 23 0.00000 0.00000 0.00000 0.00000 0.00000 24 0.00000 0.00000 0.00000 0.00000 0.00000 25 0.00000 0.00000 0.00000 0.00000 0.00000 26 0.00000 0.00000 0.00000 0.00000 0.00000 27 0.00000 0.00000 0.00000 0.00000 0.00000 28 0.00000 0.00000 0.00000 0.00000 0.00000 29 32.30000 0.00001 0.00001 4.57200 0.00000 30 0.00000 0.00000 0.00000 0.00000 0.00000 31 0.00000 0.00000 0.00000 0.00000 0.00000 # # Minimum/Maximum values: # 1 0.00000 0.00000 0.00000 0.00000 0.00000 63 32.30000 0.00001 0.00001 4.57200 0.00000 ----- end file ----- Note: Spaces in the first column are real. I would like to read in a data.frame containing only the data between: " # # Daily values: #" and " # # Minimum/Maximum values: #" but the number of columns in the dataset will vary. The information describing how it veries is contained in the sections: " char HillVarNames[ 3 ] {Days In Simulation} {Hillslope: Precipitation (mm)} {Hillslope: Average detachment (kg/m**2)}" and " char OFEVarNames[ 3 ] {Irrigation depth (mm)} {Irrigation_volume_supplied/unit_area (mm)} {Runoff (mm)}" the number of columns is the sum of HillVarNames and OFEVarNames (6), and the column labels are listed below. Depending on options in the model run which generates this file, the number of columns can change. But I would like to write a function that reads the file and makes a data.frame with two columns, day and runoff, in this case columns 1 and 6 in the file. If I can parse the variable names into a vector I can determine which element has {Days In Simulation} and {Runoff (mm)} but I am having trouble finding a function that will allow me to read in parts of the file and use information gathered along the way to direct additional reading. The procedure I invision will look like this: (1) skip first 9 lines (2) read 3rd word in next line and assign to variable hillvarnames (3) read hillvarnames more lines (4) test which line has the value {Days In Simulation} and assign index to daycolumn. (5) skip 3 lines (6) read 3rd word in next line and assign to variable ofevarnames (7) read ofevarnames more lines (8) test which line has the value {Runoff (mm)} and assign index+hillvarnames to runoffcolumn. (9) skip 3 lines (10) read lines until 5 lines remain and assign the values in the daycolumn and runoffcolumn columns to a data.frame with columns day and runoff. Is this a reasonable thing to do in R? Are there some functions that will make this task less difficult? Is there a function that alows you to read a small amount of information, parse it, test it, and then begin reading again where it left off? I am using the following R version: _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 1 minor 6.1 year 2002 month 11 day 01 language R Thank you in advance. With best wishes and kind regards I am Sincerely, Corey A. Moffet Support Scientist University of Idaho Northwest Watershed Research Center 800 Park Blvd, Plaza IV, Suite 105 Boise, ID 83712-7716 (208) 422-0718 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2002-Nov-19 19:11 UTC
[R] help reading a variably formatted text file
On Tue, 19 Nov 2002, Corey Moffet wrote:> Is this a reasonable thing to do in R? Are there some functions that > will make this task less difficult? Is there a function that alows you to > read a small amount of information, parse it, test it, and then begin reading > again where it left off?That's what connections and pushbacks are for. ?connection ?pushBack -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 19 Nov 2002, Corey Moffet stated:> Dear R-Help, > > I have a generated file that looks like the following: > .... > Is this a reasonable thing to do in R? Are there some functions that will > make this task less difficult? Is there a function that alows you to read > a small amount of information, parse it, test it, and then begin reading > again where it left off?This function seems to work, on your sample file at least, read.hill <- function (file) { lines <- scan (file, what = "", sep = "\n", quiet = TRUE) ## Get the line starting with ' char' chars <- grep ("^ char", lines) ## Get the number of columns ncols <- get.numbers (lines[chars]) ## Get the column labels labels <- lines[rep (chars, ncols) + as.vector (sapply (ncols, seq, from = 1))] ## days.col <- grep ("Days", labels) runoff.col <- grep ("Runoff", labels) ## Get the numbers toSkip <- grep ("Daily values", lines) + 1 toRead <- grep ("Minimum/Maximum", lines) - 2 - toSkip temp <- unlist (strsplit (lines[(toSkip+1):(toSkip+toRead)], split = " +")) ## There are some "" at the first column temp <- matrix (temp, ncol = length (labels) + 1, byrow = TRUE) data.frame (days = as.numeric (temp[, days.col + 1]), runoff = as.numeric (temp[, runoff.col + 1])) } get.numbers () is a function that I wrote to extract numbers from a character vector that match a certain pattern. get.numbers <- function (ss, pattern, ignore.case = FALSE) { if (!missing (pattern)) { ss <- grep (pattern, x = ss, ignore.case = ignore.case, extended = TRUE, value = TRUE) } if (length (ss) == 0) { return (NULL) } ## split at non numeric, non-dot characters and two or more dots ## FIXME: this is not the optimal split token <- strsplit (ss, split = "([^-+.0-9]|--+|\\+\\++|\\.\\.+| \t)") ## remove any trailing '.' token <- lapply (token, function (x) sub ("\\.$", "", x)) ## remove empty strings and convert to numeric token <- lapply (token, function (x) { as.numeric (x[sapply (x, function (y) y != "")]) }) if (is.null (names (ss))) { names (token) <- ss } else { names (token) <- names (ss) } token } As a test:> read.hill ("hillslope.dat")days runoff 1 1 0 2 2 0 3 3 0 4 4 0 5 5 0 6 6 0 7 7 0 8 8 0 9 9 0 10 10 0 11 11 0 12 12 0 13 13 0 14 14 0 15 15 0 16 16 0 17 17 0 18 18 0 19 19 0 20 20 0 21 21 0 22 22 0 23 23 0 24 24 0 25 25 0 26 26 0 27 27 0 28 28 0 29 29 0 30 30 0 31 31 0 As Jason pointed out, Perl might be more suitable to this job. However, I do like using R to parse many weird files. I find maintaining R scripts much easier than Perl and it is often more convenient to read a file directly into R. It would be nice to have more powerful regex in R, such as returning matched substring grouped with "()". Michael -- ---------------------------------------------------------------------------- Michael Na Li Email: lina at u.washington.edu Department of Biostatistics, Box 357232 University of Washington, Seattle, WA 98195 --------------------------------------------------------------------------- -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2002-Nov-19 21:41 UTC
[R] help reading a variably formatted text file
On Tue, 19 Nov 2002, Michael Na Li wrote:> It would be nice to have more powerful regex in R, such as returning matched > substring grouped with "()".I think you are overlooking the power of gsub. You can certainly do that. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi, | Date: Tue, 19 Nov 2002 11:11:41 -0700 | From: Corey Moffet <cmoffet at nwrc.ars.usda.gov> | | I have a generated file that looks like the following: | | ----- Begin file ----- | # | # Output File | # | float Version 2002.700000000000 As I understand, you have generated the file yourself using a different software. In that case I strongly recommend to consider using XML as the format of the data file. It allows much more flexible parsing and the changes in reading routines are simple if you change the format of data. Many programs have pre-programmed XML parsers, among them R (package XML) and perl. I have used XML with success while transfering complicated estimation results from SAS and GAUSS to R. Just a suggestion. Ott -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._