Hi all, I have a large file (1.8 GB) with 900,000 lines that I would like to read. Each line is a string characters. Specifically I would like to randomly select 3000 lines. For smaller files, what I'm doing is: trs <- scan("myfile", what= character(), sep = "\n") trs<- trs[sample(length(trs), 3000)] And this works OK; however my computer seems not able to handle the 1.8 G file. I thought of an alternative way that not require to read the whole file: sel <- sample(1:900000, 3000) for (i in 1:3000) { un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) write(un, "myfile_short", append=TRUE) } This works on my computer; however it is extremely slow; it read one line each time. It is been running for 25 hours and I think it has done less than half of the file (Yes, probably I do not have a very good computer and I'm working under Windows ...). So my question is: do you know any other faster way to do this? Thanks in advance Juli -- http://www.ceam.es/pausas [[alternative HTML version deleted]]
Hi. General idea: 1. Open your file as a connection, i.e. con <- file(pathname, open="r") 2. Generate a "row to (file offset, row length) map of your text file, i.e. a numeric vector 'fileOffsets' and 'rowLengths'. Use readBin() for this. You build this up as you go by reading the file in chunks meaning you can handles files of any size. You can store this lookup map to file for your future R sessions. 3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows sample(length(fileOffsets)). 4. Look up the file offsets and row lengths for these rows, i.e. offsets = fileOffsets[rows]. lengths = rowLengths[rows]. 5. In case your subset of rows is not ordered, it is wise to order them first to speed up things. If order is important, keep track of the ordering and re-order them at then end. 6. For each row r, use seek(con=con, where=offsets[r]) to jump to the start of the row. Use readBin(..., n=lengths[r]) to read the data. 7. Repeat from (3). /Henrik On 2/2/07, juli g. pausas <pausas at gmail.com> wrote:> Hi all, > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > Each line is a string characters. Specifically I would like to randomly > select 3000 lines. For smaller files, what I'm doing is: > > trs <- scan("myfile", what= character(), sep = "\n") > trs<- trs[sample(length(trs), 3000)] > > And this works OK; however my computer seems not able to handle the 1.8 G > file. > I thought of an alternative way that not require to read the whole file: > > sel <- sample(1:900000, 3000) > for (i in 1:3000) { > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > write(un, "myfile_short", append=TRUE) > } > > This works on my computer; however it is extremely slow; it read one line > each time. It is been running for 25 hours and I think it has done less than > half of the file (Yes, probably I do not have a very good computer and I'm > working under Windows ...). > So my question is: do you know any other faster way to do this? > Thanks in advance > > Juli > > -- > http://www.ceam.es/pausas > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:> Hi all, > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > Each line is a string characters. Specifically I would like to randomly > select 3000 lines. For smaller files, what I'm doing is: > > trs <- scan("myfile", what= character(), sep = "\n") > trs<- trs[sample(length(trs), 3000)] > > And this works OK; however my computer seems not able to handle the 1.8 G > file. > I thought of an alternative way that not require to read the whole file: > > sel <- sample(1:900000, 3000) > for (i in 1:3000) { > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > write(un, "myfile_short", append=TRUE) > } > > This works on my computer; however it is extremely slow; it read one line > each time. It is been running for 25 hours and I think it has done less than > half of the file (Yes, probably I do not have a very good computer and I'm > working under Windows ...). > So my question is: do you know any other faster way to do this? > Thanks in advance > > JuliJuli, I don't have a file to test this on, so caveat emptor. The problem with the approach above, is that you are re-reading the source file, once per line, or 3000 times. In addition, each read is likely going through half the file on average to locate the randomly selected line. Thus, the reality is that you are probably reading on the order of:> 3000 * 450000[1] 1.35e+09 lines in the file, which of course if going to be quite slow. In addition, you are also writing to the target file 3000 times. The basic premise with this approach below, is that you are in effect creating a sequential file cache in an R object. Reading large chunks of the source file into the cache. Then randomly selecting rows within the cache and then writing out the selected rows. Thus, if you can read 100,000 rows at once, you would have 9 reads of the source file, and 9 writes of the target file. The key thing here is to ensure that the offsets within the cache and the corresponding random row values are properly set. Here's the code: # Generate the random values sel <- sample(1:900000, 3000) # Set up a sequence for the cache chunks # Presume you can read 100,000 rows at once Cuts <- seq(0, 900000, 100000) # Loop over the length of Cuts, less 1 for (i in seq(along = Cuts[-1])) { # Get a 100,000 row chunk, skipping rows # as appropriate for each subsequent chunk Chunk <- scan("myfile", what = character(), sep = "\n", skip = Cuts[i], nlines = 100000) # set up a row sequence for the current # chunk Rows <- (Cuts[i] + 1):(Cuts[i + 1]) # Are any of the random values in the # current chunk? Chunk.Sel <- sel[which(sel %in% Rows)] # If so, get them if (length(Chunk.Sel) > 0) { Write.Rows <- Chunk[sel - Cuts[i]] # Now write them out write(Write.Rows, "myfile_short", append = TRUE) } } As noted, I have not tested this, so there may yet be additional ways to save time with file seeks, etc. HTH, Marc Schwartz
I had a file with 200,000 lines in it and it took 1 second to select 3000 sample lines out of it. One of the things is to use a connection so that the file stays opens and then just 'skip' to the next record to read:> input <- file("/tempxx.txt", "r") > sel <- 3000 > remaining <- 200000 > # get the records numbers to select > recs <- sort(sample(1:remaining, sel)) > # compute number to skip on each read; account for the record just read > skip <- diff(c(1, recs)) - 1 > # allocate my data > mysel <- vector('character', sel) > system.time({+ for (i in 1:sel){ + mysel[i] <- scan(input, what="", sep="\n", skip=skip[i], n=1, quiet=TRUE) + } + }) [1] 0.97 0.02 1.00 NA NA> >On 2/2/07, juli g. pausas <pausas at gmail.com> wrote:> Hi all, > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > Each line is a string characters. Specifically I would like to randomly > select 3000 lines. For smaller files, what I'm doing is: > > trs <- scan("myfile", what= character(), sep = "\n") > trs<- trs[sample(length(trs), 3000)] > > And this works OK; however my computer seems not able to handle the 1.8 G > file. > I thought of an alternative way that not require to read the whole file: > > sel <- sample(1:900000, 3000) > for (i in 1:3000) { > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > write(un, "myfile_short", append=TRUE) > } > > This works on my computer; however it is extremely slow; it read one line > each time. It is been running for 25 hours and I think it has done less than > half of the file (Yes, probably I do not have a very good computer and I'm > working under Windows ...). > So my question is: do you know any other faster way to do this? > Thanks in advance > > Juli > > -- > http://www.ceam.es/pausas > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
Thank so much for your help and comments. The approach proposed by Jim Holtman was the simplest and fastest. The approach by Marc Schwartz also worked (after a very small modification). It is clear that a good knowledge of R save a lot of time!! I've been able to do in few minutes a process that was only 1/4th done after 25 h! Many thanks Juli On 02/02/07, juli g. pausas <pausas at gmail.com> wrote:> Hi all, > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > Each line is a string characters. Specifically I would like to randomly > select 3000 lines. For smaller files, what I'm doing is: > > trs <- scan("myfile", what= character(), sep = "\n") > trs<- trs[sample(length(trs), 3000)] > > And this works OK; however my computer seems not able to handle the 1.8 G > file. > I thought of an alternative way that not require to read the whole file: > > sel <- sample(1:900000, 3000) > for (i in 1:3000) { > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > write(un, "myfile_short", append=TRUE) > } > > This works on my computer; however it is extremely slow; it read one line > each time. It is been running for 25 hours and I think it has done less than > half of the file (Yes, probably I do not have a very good computer and I'm > working under Windows ...). > So my question is: do you know any other faster way to do this? > Thanks in advance > > Juli > > -- > http://www.ceam.es/pausas >-- http://www.ceam.es/pausas
On Sat, 2007-02-03 at 19:06 +0100, juli g. pausas wrote:> Thank so much for your help and comments. > The approach proposed by Jim Holtman was the simplest and fastest. The > approach by Marc Schwartz also worked (after a very small > modification). > > It is clear that a good knowledge of R save a lot of time!! I've been > able to do in few minutes a process that was only 1/4th done after 25 > h! > > Many thanks > > JuliJuli, Just out of curiosity, what change did you make? Also, what were the running times for the solutions? Regards, Marc