thr3ads.net - R help - [R] Parsing [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Paolo Sonego

2008-Jul-09 09:33 UTC

[R] Parsing

Dear R users,

I have a big text file formatted like this:

x      x_string
y      y_string
id1    id1_string
id2    id2_string
z      z_string
w      w_string
stuff  stuff  stuff
stuff  stuff  stuff
stuff  stuff  stuff
//
x      x_string1
y      y_string1
z      z_string1
w      w_string1
stuff  stuff  stuff
stuff  stuff  stuff
stuff  stuff  stuff
//
x      x_string2
y      y_string2
id1    id1_string1
id2    id2_string1
z      z_string2
w      w_string2
stuff  stuff  stuff
stuff  stuff  stuff
stuff  stuff  stuff
//
...
...


I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields 
and save them into a a matrix object:

x        y          id1         id2         z          w
x_string y_string   id1_string  id2_string  z_string   w_string  
x_string1 y_string1 NA          NA          z_string1  w_string1
x_string2 y_string2 id1_string1 id2_string1 z_string2  w_string2
...
...

id1, id2 fields  are not always present within a section (the interval 
between x and the last stuff) and
I'd like to insert a NA when they are absent (see above) so that 
length(x)==length(y)==length(id1)==... .

Without the id1, id2 fields the task is easily solvable  importing the 
text file with readLines and retrieving the single fields with grep:

input = readLines("file.txt")
x = grep("^x\\s", input, value = T)
id1 = grep("^id1\\s", input, value = T)
...

I'd like to accomplish this task entirely in R (no SQL, no perl 
script),  possibly without using loops.

Any suggestions are quite welcome!

Regards,
Paolo

jim holtman

2008-Jul-09 12:30 UTC

head link

[R] Parsing

This should do what you want: (it uses loops; you can work at
replacing those with 'lapply' and such -- it all depends on if it is
going to take you more time to rewrite the code than to process a set
of data; you never did say how large the data was).  This also "grows"
a data.frame, but you have not indicated how efficient is has to be.
So this could be used as a model.
> x <- readLines(textConnection("x      x_string+ y      y_string
+ id1    id1_string
+ id2    id2_string
+ z      z_string
+ w      w_string
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //
+ x      x_string1
+ y      y_string1
+ z      z_string1
+ w      w_string1
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //
+ x      x_string2
+ y      y_string2
+ id1    id1_string1
+ id2    id2_string1
+ z      z_string2
+ w      w_string2
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ stuff  stuff  stuff
+ //"))> # I assume that each group is delimited by "//"
> # initialize data.frame with desired values
> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
> .out <- .keys  # for the first pass
> .save <- NULL
> for (i in seq_along(x)){+     if (x[i] == "//"){  # output the current data
+         .save <- rbind(.save, .out)
+         .out <- .keys    # setup for the next pass
+     } else {
+         .split <- strsplit(x[i], "\\s+")
+         if (.split[[1]][1] %in% names(.out)){
+             .out[[.split[[1]][1]]] <- .split[[1]][2]
+         }
+     }
+ }> .save          x         y         id1         id2         w
1  x_string  y_string  id1_string  id2_string  w_string
2 x_string1 y_string1        <NA>        <NA> w_string1
3 x_string2 y_string2 id1_string1 id2_string1 w_string2


On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego at gmail.com>
wrote:> Dear R users,
>
> I have a big text file formatted like this:
>
> x      x_string
> y      y_string
> id1    id1_string
> id2    id2_string
> z      z_string
> w      w_string
> stuff  stuff  stuff
> stuff  stuff  stuff
> stuff  stuff  stuff
> //
> x      x_string1
> y      y_string1
> z      z_string1
> w      w_string1
> stuff  stuff  stuff
> stuff  stuff  stuff
> stuff  stuff  stuff
> //
> x      x_string2
> y      y_string2
> id1    id1_string1
> id2    id2_string1
> z      z_string2
> w      w_string2
> stuff  stuff  stuff
> stuff  stuff  stuff
> stuff  stuff  stuff
> //
> ...
> ...
>
>
> I'd like to parse this file and retrieve the x, y, id1, id2, z, w
fields and
> save them into a a matrix object:
>
> x        y          id1         id2         z          w
> x_string y_string   id1_string  id2_string  z_string   w_string  x_string1
> y_string1 NA          NA          z_string1  w_string1
> x_string2 y_string2 id1_string1 id2_string1 z_string2  w_string2
> ...
> ...
>
> id1, id2 fields  are not always present within a section (the interval
> between x and the last stuff) and
> I'd like to insert a NA when they are absent (see above) so that
> length(x)==length(y)==length(id1)==... .
>
> Without the id1, id2 fields the task is easily solvable  importing the text
> file with readLines and retrieving the single fields with grep:
>
> input = readLines("file.txt")
> x = grep("^x\\s", input, value = T)
> id1 = grep("^id1\\s", input, value = T)
> ...
>
> I'd like to accomplish this task entirely in R (no SQL, no perl
script),
>  possibly without using loops.
>
> Any suggestions are quite welcome!
>
> Regards,
> Paolo
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

Paolo Sonego

2008-Jul-10 08:24 UTC

head link

[R] Parsing

Thank you Martin! This code is amazing! SO fast! Exactly what i was 
looking for!

Parsing ~8M lines (~ 600M file size) took about 45s on a Xeon 3,4 Ghz (8 
Gb).

Thank you so much!

Sincerely,
Paolo


Martin Morgan ha scritto:> Paolo Sonego <paolo.sonego a gmail.com> writes:
>
>   
>> I apologize  for giving wrong information again ...  :-[
>> The number of files is not a problem (30/40). The real deal is that
>> some of my files have ~10^6  lines (file size ~ 300/400M)  :'(
>> Thanks again for your help and advices!
>>     
>
> If memory is not an issue, then this might be reasonably performant...
>
> process_chunk <- function(txt, rec_sep, keys)
> {
>     ## filter
>     keep_regex <- paste("^(",
>                         paste(rec_sep, keys, sep="|",
collapse="|"),
>                         ")", sep="")
>     txt <- txt[grep(keep_regex, txt)]
>
>     ## construct key/value pairs
>     splt <- strsplit(txt, "\\W+")
>     val <- unlist(lapply(splt, "[", 2))
>     names(val) <- unlist(lapply(splt, "[", 1))
>
>     ## break key/value into records
>     ends <- c(grep(rec_sep, txt), length(txt))
>     grps <- rep(seq_along(ends), c(ends[1], diff(ends)))
>     recs <- split(val, grps)
>
>     ## reformat as matrix
>     sapply(keys, function(key, recs) {
>         res <- sapply(recs, "[", key)
>         names(res) <- NULL
>         res
>     }, recs=recs)
> }
>
>   
>> rec <- "//"
>> keys <- 
>> process_chunk(readLines("/tmp/tmp.txt"), rec, keys)
>>     
>      x           y           z           w           id1          
> [1,] "x_string"  "y_string"  "z_string" 
"w_string"  "id1_string"
> [2,] "x_string1" "y_string1" "z_string1"
"w_string1" NA
> [3,] "x_string2" "y_string2" "z_string2"
"w_string2" "id1_string1"
>      id2          
> [1,] "id2_string" 
> [2,] NA           
> [3,] "id2_string1"
>
> This took about 130s and no more than 250Mb to process your data
> replicated to about 5M lines (~ 80M file size)
>
> I haven't really tested the following, but this might also be useful
> for processing in chunks
>
> process <- function(filename, 
>                     rec_sep="//",
>                     keys=c("x", "y", "z",
"w", "id1", "id2"),
>                     chunk_size = 10^6)
> {
>     result <- NULL
>     resid <- character(0)
>     con <- file(filename, "r")
>     while(length(txt <- readLines(con, chunk_size)) != 0) {
>         recs <- grep(rec_sep, txt)
>         if (length(recs) > 0) {
>             maxrec <- max(recs)
>             if (maxrec == length(txt)) buf <- character(0)
>             else buf <- txt[(maxrec+1):length(txt)]
>             txt <- c(resid, txt[-(maxrec:length(txt))])
>             resid <- buf
>         } else {
>             txt <- c(resid, txt)
>             resid <- character(0)
>         }
>         result <-
>             rbind(result,
>                   process_chunk(txt, rec_sep=rec_sep, keys=keys))
>                   
>     }
>     close(con)
>     if (length(resid) != 0) {
>         result <-
>             rbind(result,
>                   process_chunk(resid, rec_sep=rec_sep, keys=keys))
>     }
>     result
> }
>
>   
>> process('/tmp/tmp.txt', chunk_size=10L) # make size much larger
>>     
>      x           y           z           w           id1          
> [1,] "x_string"  "y_string"  "z_string" 
"w_string"  "id1_string"
> [2,] "x_string1" "y_string1" "z_string1"
"w_string1" NA
> [3,] "x_string2" "y_string2" "z_string2"
"w_string2" "id1_string1"
>      id2          
> [1,] "id2_string" 
> [2,] NA           
> [3,] "id2_string1"
>
>
>
>   
>> Regards,
>> Paolo
>>
>>
>> jim holtman ha scritto:
>>     
>>> How much time is it taking on the files and how many files do you
have
>>> to process?  I tried it with your data duplicated so that I had 57K
>>> lines and it took 27 seconds to process.  How much faster to you
want?
>>>       
>> ______________________________________________
>> R-help a r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>     
>
>

R help - Jul 2008 - Parsing

[R] Parsing

[R] Parsing

[R] Parsing