emorway
2012-Jun-06 16:54 UTC
[R] extracting values from txt file that follow user-supplied quote
useRs- I'm attempting to scan a more than 1Gb text file and read and store the values that follow a specific key-phrase that is repeated multiple time throughout the file. A snippet of the text file I'm trying to read is attached. The text file is a dumping ground for various aspects of the performance of the model that generates it. Thus, the location of information I'm wanting to extract from the file is not in a fixed position (i.e. it does not always appears in a predictable location, like line 1000, or 2000, etc.). Rather, the desired values always follow a specific phrase: " PERCENT DISCREPANCY =" One approach I took was the following: library(R.utils) txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") #The above will need to be altered if one desires to test code on the attached txt file, which will run much quicker system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon num_lines #14405247 system.time( for(i in 1:num_lines){ txt_line<-readLines(txt_con,n=1) if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { pd<-c(pd,as.numeric(substr(txt_line,70,78))) } } ) #Time took about 5 minutes The inefficiencies in this approach arise due to reading the file twice (first to get num_lines, then to step through each line looking for the desired text). Is there a way to speed this process up through the use of a ?scan ? I wan't able to get anything working, but what I had in mind was scan through the more than 1Gb file and when the keyphrase (e.g. " PERCENT DISCREPANCY = ") is encountered, read and store the next 13 characters (which will include some white spaces) as a numeric value, then resume the scan until the key phrase is encountered again and repeat until the end-of-the-file marker is encountered. Is such an approach even possible or is line-by-line the best bet? http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out -- View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html Sent from the R help mailing list archive at Nabble.com.
Rainer Schuermann
2012-Jun-06 17:34 UTC
[R] extracting values from txt file that follow user-supplied quote
R may not be the best tool for this. Did you look at gawk? It is also available for Windows: http://gnuwin32.sourceforge.net/packages/gawk.htm Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps. You also can run gawk from within R with the System() command. Rgds, Rainer On Wednesday 06 June 2012 09:54:15 emorway wrote:> useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. A snippet of the text file I'm trying to read is > attached. The text file is a dumping ground for various aspects of the > performance of the model that generates it. Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). Rather, the desired values always follow a specific phrase: > " PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > txt_line<-readLines(txt_con,n=1) > if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { > pd<-c(pd,as.numeric(substr(txt_line,70,78))) > } > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ? I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. " PERCENT > DISCREPANCY = ") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Rui Barradas
2012-Jun-07 18:57 UTC
[R] extracting values from txt file that follow user-supplied quote
Hello, I've just read your follow-up question on regular expressions, and I believe this, your original problem, can be made much faster. Just use readLine() differently, reading large amounts of text lines at a time. For this to work you will still need to know the total number of lines in the file. fun <- function(con, pattern, nlines, n=5000L){ if(is.character(con)){ con <- file(con, open="rt") on.exit(close(con)) } passes <- nlines %/% n remaining <- nlines %% n res <- NULL for(i in seq_len(passes)){ txt <- readLines(con, n=n) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } if(remaining){ txt <- readLines(con, n=remaining) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } res } url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out" pat <- "PERCENT DISCREPANCY =" num_lines <- 14405247L # your original txt_con<-file(description=url,open="r") pd <- NULL t1 <- system.time( for(i in 1:num_lines){ txt_line<-readLines(txt_con,n=1) if (length(grep(pat,txt_line))) { pd<-c(pd,as.numeric(substr(txt_line,70,78))) } } ) close(txt_con) # the function above, increased 'n' t2 <- system.time(pd2 <- fun(url, pat, num_lines, 100000L)) all.equal(pd, pd2) [1] TRUE rbind(original=t1, fun=t2, ratio=t1/t2) user.self sys.self elapsed user.child sys.child original 780.16 196.16 981.9100 NA NA fun 0.10 0.04 3.2000 NA NA ratio 7801.60 4904.00 306.8469 NA NA A factor of 300. Hope this helps, Rui Barradas Em 06-06-2012 17:54, emorway escreveu:> useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. A snippet of the text file I'm trying to read is > attached. The text file is a dumping ground for various aspects of the > performance of the model that generates it. Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). Rather, the desired values always follow a specific phrase: > " PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > txt_line<-readLines(txt_con,n=1) > if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { > pd<-c(pd,as.numeric(substr(txt_line,70,78))) > } > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ? I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. " PERCENT > DISCREPANCY = ") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2012-Jun-08 10:41 UTC
[R] extracting values from txt file that follow user-supplied quote
On Wed, Jun 6, 2012 at 12:54 PM, emorway <emorway at usgs.gov> wrote:> useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. ?A snippet of the text file I'm trying to read is > attached. ?The text file is a dumping ground for various aspects of the > performance of the model that generates it. ?Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). ?Rather, the desired values always follow a specific phrase: > " ? PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > ?txt_line<-readLines(txt_con,n=1) > ?if (length(grep(" ? ?PERCENT DISCREPANCY =",txt_line))) { > ? ?pd<-c(pd,as.numeric(substr(txt_line,70,78))) > ?} > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ?? ?I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. ?" ? ? PERCENT > DISCREPANCY = ?") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. ?Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.outTry this: g <- function(url, string, from, to, ...) { L <- readLines(url) matched <- grep(string, L, value = TRUE, ...) as.numeric(substring(matched, from, to)) }> url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out" > g(url, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE)[1] NA 0.00 -0.01 NA 0.00 -0.01 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com