thr3ads.net - R help - [R] extracting values from txt file that follow user-supplied quote [Jun 2012]

If this information is useful, please help other people find it:
Share via:

emorway

2012-Jun-06 16:54 UTC

[R] extracting values from txt file that follow user-supplied quote

useRs- 

I'm attempting to scan a more than 1Gb text file and read and store the
values that follow a specific key-phrase that is repeated multiple time
throughout the file.  A snippet of the text file I'm trying to read is
attached.  The text file is a dumping ground for various aspects of the
performance of the model that generates it.  Thus, the location of
information I'm wanting to extract from the file is not in a fixed position
(i.e. it does not always appears in a predictable location, like line 1000,
or 2000, etc.).  Rather, the desired values always follow a specific phrase:
"   PERCENT DISCREPANCY ="

One approach I took was the following:

library(R.utils)

txt_con<-file(description="D:/MCR_BeoPEST -
Copy/MCR.out",open="r")
#The above will need to be altered if one desires to test code on the
attached txt file, which will run much quicker
system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
#elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon 
num_lines
#14405247

system.time(
for(i in 1:num_lines){
  txt_line<-readLines(txt_con,n=1)
  if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
    pd<-c(pd,as.numeric(substr(txt_line,70,78)))
  }
}
)
#Time took about 5 minutes

The inefficiencies in this approach arise due to reading the file twice
(first to get num_lines, then to step through each line looking for the
desired text).  

Is there a way to speed this process up through the use of a ?scan  ?  I
wan't able to get anything working, but what I had in mind was scan through
the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
DISCREPANCY =  ") is encountered, read and store the next 13 characters
(which will include some white spaces) as a numeric value, then resume the
scan until the key phrase is encountered again and repeat until the
end-of-the-file marker is encountered.  Is such an approach even possible or
is line-by-line the best bet?

http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out 



--
View this message in context:
http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
Sent from the R help mailing list archive at Nabble.com.

Rainer Schuermann

2012-Jun-06 17:34 UTC

head link

[R] extracting values from txt file that follow user-supplied quote

R may not be the best tool for this. 
Did you look at gawk? It is also available for Windows:
http://gnuwin32.sourceforge.net/packages/gawk.htm

Once gawk has written a new file that only contains the lines / data you want,
you could use R for the next steps.
You also can run gawk from within R with the System() command.

Rgds,
Rainer


On Wednesday 06 June 2012 09:54:15 emorway wrote:> useRs- 
> 
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file.  A snippet of the text file I'm trying to read is
> attached.  The text file is a dumping ground for various aspects of the
> performance of the model that generates it.  Thus, the location of
> information I'm wanting to extract from the file is not in a fixed
position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.).  Rather, the desired values always follow a specific
phrase:
> "   PERCENT DISCREPANCY ="
> 
> One approach I took was the following:
> 
> library(R.utils)
> 
> txt_con<-file(description="D:/MCR_BeoPEST -
Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST -
Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon 
> num_lines
> #14405247
> 
> system.time(
> for(i in 1:num_lines){
>   txt_line<-readLines(txt_con,n=1)
>   if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>   }
> }
> )
> #Time took about 5 minutes
> 
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).  
> 
> Is there a way to speed this process up through the use of a ?scan  ?  I
> wan't able to get anything working, but what I had in mind was scan
through
> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
> DISCREPANCY =  ") is encountered, read and store the next 13
characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered.  Is such an approach even possible
or
> is line-by-line the best bet?
> 
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out 
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2012-Jun-07 18:57 UTC

head link

[R] extracting values from txt file that follow user-supplied quote

Hello,

I've just read your follow-up question on regular expressions, and I 
believe this, your original problem, can be made much faster. Just use 
readLine() differently, reading large amounts of text lines at a time. 
For this to work you will still need to know the total number of lines 
in the file.



fun <- function(con, pattern, nlines, n=5000L){
	if(is.character(con)){
		con <- file(con, open="rt")
		on.exit(close(con))
	}
	passes <- nlines %/% n
	remaining <- nlines %% n
	res <- NULL
	for(i in seq_len(passes)){
		txt <- readLines(con, n=n)
		res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
	}
	if(remaining){
		txt <- readLines(con, n=remaining)
		res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
	}
	res
}


url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out"
pat <- "PERCENT DISCREPANCY ="
num_lines <- 14405247L

# your original
txt_con<-file(description=url,open="r")
pd <- NULL
t1 <- system.time(
for(i in 1:num_lines){
   txt_line<-readLines(txt_con,n=1)
   if (length(grep(pat,txt_line))) {
     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
   }
}
)
close(txt_con)

# the function above, increased 'n'
t2 <- system.time(pd2 <- fun(url, pat, num_lines, 100000L))

all.equal(pd, pd2)
[1] TRUE
rbind(original=t1, fun=t2, ratio=t1/t2)
          user.self sys.self  elapsed user.child sys.child
original    780.16   196.16 981.9100         NA        NA
fun           0.10     0.04   3.2000         NA        NA
ratio      7801.60  4904.00 306.8469         NA        NA


A factor of 300.

Hope this helps,

Rui Barradas

Em 06-06-2012 17:54, emorway escreveu:> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file.  A snippet of the text file I'm trying to read is
> attached.  The text file is a dumping ground for various aspects of the
> performance of the model that generates it.  Thus, the location of
> information I'm wanting to extract from the file is not in a fixed
position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.).  Rather, the desired values always follow a specific
phrase:
> "   PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST -
Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST -
Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
>    txt_line<-readLines(txt_con,n=1)
>    if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>      pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>    }
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).
>
> Is there a way to speed this process up through the use of a ?scan  ?  I
> wan't able to get anything working, but what I had in mind was scan
through
> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
> DISCREPANCY =  ") is encountered, read and store the next 13
characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered.  Is such an approach even possible
or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Gabor Grothendieck

2012-Jun-08 10:41 UTC

head link

[R] extracting values from txt file that follow user-supplied quote

On Wed, Jun 6, 2012 at 12:54 PM, emorway <emorway at usgs.gov>
wrote:> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file. ?A snippet of the text file I'm trying to read is
> attached. ?The text file is a dumping ground for various aspects of the
> performance of the model that generates it. ?Thus, the location of
> information I'm wanting to extract from the file is not in a fixed
position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.). ?Rather, the desired values always follow a specific
phrase:
> " ? PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST -
Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST -
Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
> ?txt_line<-readLines(txt_con,n=1)
> ?if (length(grep(" ? ?PERCENT DISCREPANCY =",txt_line))) {
> ? ?pd<-c(pd,as.numeric(substr(txt_line,70,78)))
> ?}
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).
>
> Is there a way to speed this process up through the use of a ?scan ?? ?I
> wan't able to get anything working, but what I had in mind was scan
through
> the more than 1Gb file and when the keyphrase (e.g. ?" ? ? PERCENT
> DISCREPANCY = ?") is encountered, read and store the next 13
characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered. ?Is such an approach even possible
or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
Try this:

g <- function(url, string, from, to, ...) {
	L <- readLines(url)
	matched <- grep(string, L, value = TRUE, ...)
	as.numeric(substring(matched, from, to))
}
> url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out"
> g(url, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE)[1]    NA  0.00 -0.01    NA  0.00 -0.01


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Maybe Matching Threads

Search for more reasonably related threads

R help - Jun 2012 - extracting values from txt file that follow user-supplied quote

[R] extracting values from txt file that follow user-supplied quote

[R] extracting values from txt file that follow user-supplied quote

[R] extracting values from txt file that follow user-supplied quote

[R] extracting values from txt file that follow user-supplied quote

Maybe Matching Threads