Dear R users, I have a big text file formatted like this: x x_string y y_string id1 id1_string id2 id2_string z z_string w w_string stuff stuff stuff stuff stuff stuff stuff stuff stuff // x x_string1 y y_string1 z z_string1 w w_string1 stuff stuff stuff stuff stuff stuff stuff stuff stuff // x x_string2 y y_string2 id1 id1_string1 id2 id2_string1 z z_string2 w w_string2 stuff stuff stuff stuff stuff stuff stuff stuff stuff // ... ... I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and save them into a a matrix object: x y id1 id2 z w x_string y_string id1_string id2_string z_string w_string x_string1 y_string1 NA NA z_string1 w_string1 x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2 ... ... id1, id2 fields are not always present within a section (the interval between x and the last stuff) and I'd like to insert a NA when they are absent (see above) so that length(x)==length(y)==length(id1)==... . Without the id1, id2 fields the task is easily solvable importing the text file with readLines and retrieving the single fields with grep: input = readLines("file.txt") x = grep("^x\\s", input, value = T) id1 = grep("^id1\\s", input, value = T) ... I'd like to accomplish this task entirely in R (no SQL, no perl script), possibly without using loops. Any suggestions are quite welcome! Regards, Paolo
This should do what you want: (it uses loops; you can work at replacing those with 'lapply' and such -- it all depends on if it is going to take you more time to rewrite the code than to process a set of data; you never did say how large the data was). This also "grows" a data.frame, but you have not indicated how efficient is has to be. So this could be used as a model.> x <- readLines(textConnection("x x_string+ y y_string + id1 id1_string + id2 id2_string + z z_string + w w_string + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + // + x x_string1 + y y_string1 + z z_string1 + w w_string1 + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + // + x x_string2 + y y_string2 + id1 id1_string1 + id2 id2_string1 + z z_string2 + w w_string2 + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + //"))> # I assume that each group is delimited by "//" > # initialize data.frame with desired values > .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA) > .out <- .keys # for the first pass > .save <- NULL > for (i in seq_along(x)){+ if (x[i] == "//"){ # output the current data + .save <- rbind(.save, .out) + .out <- .keys # setup for the next pass + } else { + .split <- strsplit(x[i], "\\s+") + if (.split[[1]][1] %in% names(.out)){ + .out[[.split[[1]][1]]] <- .split[[1]][2] + } + } + }> .savex y id1 id2 w 1 x_string y_string id1_string id2_string w_string 2 x_string1 y_string1 <NA> <NA> w_string1 3 x_string2 y_string2 id1_string1 id2_string1 w_string2 On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego at gmail.com> wrote:> Dear R users, > > I have a big text file formatted like this: > > x x_string > y y_string > id1 id1_string > id2 id2_string > z z_string > w w_string > stuff stuff stuff > stuff stuff stuff > stuff stuff stuff > // > x x_string1 > y y_string1 > z z_string1 > w w_string1 > stuff stuff stuff > stuff stuff stuff > stuff stuff stuff > // > x x_string2 > y y_string2 > id1 id1_string1 > id2 id2_string1 > z z_string2 > w w_string2 > stuff stuff stuff > stuff stuff stuff > stuff stuff stuff > // > ... > ... > > > I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and > save them into a a matrix object: > > x y id1 id2 z w > x_string y_string id1_string id2_string z_string w_string x_string1 > y_string1 NA NA z_string1 w_string1 > x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2 > ... > ... > > id1, id2 fields are not always present within a section (the interval > between x and the last stuff) and > I'd like to insert a NA when they are absent (see above) so that > length(x)==length(y)==length(id1)==... . > > Without the id1, id2 fields the task is easily solvable importing the text > file with readLines and retrieving the single fields with grep: > > input = readLines("file.txt") > x = grep("^x\\s", input, value = T) > id1 = grep("^id1\\s", input, value = T) > ... > > I'd like to accomplish this task entirely in R (no SQL, no perl script), > possibly without using loops. > > Any suggestions are quite welcome! > > Regards, > Paolo > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
Thank you Martin! This code is amazing! SO fast! Exactly what i was looking for! Parsing ~8M lines (~ 600M file size) took about 45s on a Xeon 3,4 Ghz (8 Gb). Thank you so much! Sincerely, Paolo Martin Morgan ha scritto:> Paolo Sonego <paolo.sonego a gmail.com> writes: > > >> I apologize for giving wrong information again ... :-[ >> The number of files is not a problem (30/40). The real deal is that >> some of my files have ~10^6 lines (file size ~ 300/400M) :'( >> Thanks again for your help and advices! >> > > If memory is not an issue, then this might be reasonably performant... > > process_chunk <- function(txt, rec_sep, keys) > { > ## filter > keep_regex <- paste("^(", > paste(rec_sep, keys, sep="|", collapse="|"), > ")", sep="") > txt <- txt[grep(keep_regex, txt)] > > ## construct key/value pairs > splt <- strsplit(txt, "\\W+") > val <- unlist(lapply(splt, "[", 2)) > names(val) <- unlist(lapply(splt, "[", 1)) > > ## break key/value into records > ends <- c(grep(rec_sep, txt), length(txt)) > grps <- rep(seq_along(ends), c(ends[1], diff(ends))) > recs <- split(val, grps) > > ## reformat as matrix > sapply(keys, function(key, recs) { > res <- sapply(recs, "[", key) > names(res) <- NULL > res > }, recs=recs) > } > > >> rec <- "//" >> keys <- >> process_chunk(readLines("/tmp/tmp.txt"), rec, keys) >> > x y z w id1 > [1,] "x_string" "y_string" "z_string" "w_string" "id1_string" > [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA > [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1" > id2 > [1,] "id2_string" > [2,] NA > [3,] "id2_string1" > > This took about 130s and no more than 250Mb to process your data > replicated to about 5M lines (~ 80M file size) > > I haven't really tested the following, but this might also be useful > for processing in chunks > > process <- function(filename, > rec_sep="//", > keys=c("x", "y", "z", "w", "id1", "id2"), > chunk_size = 10^6) > { > result <- NULL > resid <- character(0) > con <- file(filename, "r") > while(length(txt <- readLines(con, chunk_size)) != 0) { > recs <- grep(rec_sep, txt) > if (length(recs) > 0) { > maxrec <- max(recs) > if (maxrec == length(txt)) buf <- character(0) > else buf <- txt[(maxrec+1):length(txt)] > txt <- c(resid, txt[-(maxrec:length(txt))]) > resid <- buf > } else { > txt <- c(resid, txt) > resid <- character(0) > } > result <- > rbind(result, > process_chunk(txt, rec_sep=rec_sep, keys=keys)) > > } > close(con) > if (length(resid) != 0) { > result <- > rbind(result, > process_chunk(resid, rec_sep=rec_sep, keys=keys)) > } > result > } > > >> process('/tmp/tmp.txt', chunk_size=10L) # make size much larger >> > x y z w id1 > [1,] "x_string" "y_string" "z_string" "w_string" "id1_string" > [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA > [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1" > id2 > [1,] "id2_string" > [2,] NA > [3,] "id2_string1" > > > > >> Regards, >> Paolo >> >> >> jim holtman ha scritto: >> >>> How much time is it taking on the files and how many files do you have >>> to process? I tried it with your data duplicated so that I had 57K >>> lines and it took 27 seconds to process. How much faster to you want? >>> >> ______________________________________________ >> R-help a r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >