thr3ads.net - R help - [R] How to read data sequentially into R (line by line)? [Oct 2011]

If this information is useful, please help other people find it:
Share via:

johannes rara

2011-Oct-18 12:12 UTC

[R] How to read data sequentially into R (line by line)?

I have a data set like this in one .txt file (cols separated by !):

APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!

it contains over 14 000 000 records. Now because I'm out of memory
when trying to handle this data in R, I'm trying to read it
sequentially and write it out in several .csv files (or .RData files)
and then read these into R one-by-one. One record in this data is
between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
Holtman's approach
(http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
problem is how to avoid cutting one record from the middle? I mean
that if I put nrows = 1000000, I don't know if one record (between
marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
that? My code so far:

zz <- file("myfile.txt", "r")
fileNo <- 1
repeat{

    gotError <- 1 # set to 2 if there is an error     # catch the
error if not more data
    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="", header=FALSE),
              error=function(x) gotError <<- 2)

    if (gotError == 2) break
    # save the intermediate data
    save(input, file=sprintf("file%03d.RData", fileNo))
    fileNo <- fileNo + 1
}
close(zz)

jim holtman

2011-Oct-18 12:50 UTC

head link

[R] How to read data sequentially into R (line by line)?

Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here).  You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.

x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")

fileNo <- 1  # used for file name
buffer <- NULL
repeat{
    input <- readLines(x, n = 100)
    if (length(input) == 0) break  # done
    buffer <- c(buffer, input)
    # find separator
    repeat{
        indx <- which(grepl("^GG!KK!KK!", buffer))[1]
        if (is.na(indx)) break  # not found yet; read more
        writeLines(buffer[1:(indx - 1L)]
            , sprintf("newFile%04d", fileNo)
            )
        buffer <- buffer[-c(1:indx)]  # remove data
        fileNo <- fileNo + 1
    }
}


On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com>
wrote:> I have a data set like this in one .txt file (cols separated by !):
>
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
>
> it contains over 14 000 000 records. Now because I'm out of memory
> when trying to handle this data in R, I'm trying to read it
> sequentially and write it out in several .csv files (or .RData files)
> and then read these into R one-by-one. One record in this data is
> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
> Holtman's approach
> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
> problem is how to avoid cutting one record from the middle? I mean
> that if I put nrows = 1000000, I don't know if one record (between
> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
> that? My code so far:
>
> zz <- file("myfile.txt", "r")
> fileNo <- 1
> repeat{
>
> ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the
> error if not more data
> ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000,
sep='!',
> row.names=NULL, na.strings="", header=FALSE),
> ? ? ? ? ? ? ?error=function(x) gotError <<- 2)
>
> ? ?if (gotError == 2) break
> ? ?# save the intermediate data
> ? ?save(input, file=sprintf("file%03d.RData", fileNo))
> ? ?fileNo <- fileNo + 1
> }
> close(zz)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Oct 2011 - How to read data sequentially into R (line by line)?

[R] How to read data sequentially into R (line by line)?

[R] How to read data sequentially into R (line by line)?

Possibly Parallel Threads