I'm wondering if anyone has written some functions or code for handling very large files in R. I am working with a data file that is 41 variables times who knows how many observations making up 27MB altogether. The sort of thing that I am thinking of having R do is - count the number of lines in a file - form a data frame by selecting all cases whose line numbers are in a supplied vector (which could be used to extract random subfiles of particular sizes) Does anyone know of a package that might be useful for this? Murray -- Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
Hi, Have you looked at "R Data Import/Export"? On Mon, 25 Aug 2003, Murray Jorgensen wrote:> Date: Mon, 25 Aug 2003 16:04:17 +1200 > From: Murray Jorgensen <maj at stats.waikato.ac.nz> > Reply-To: maj at waikato.ac.nz > To: R-help <r-help at stat.math.ethz.ch> > Subject: [R] R tools for large files > > I'm wondering if anyone has written some functions or code for handling > very large files in R. I am working with a data file that is 41 > variables times who knows how many observations making up 27MB altogether. > > The sort of thing that I am thinking of having R do is > > - count the number of lines in a file > > - form a data frame by selecting all cases whose line numbers are in a > supplied vector (which could be used to extract random subfiles of > particular sizes) > > Does anyone know of a package that might be useful for this? > > Murray > >-- Cheers, Kevin ------------------------------------------------------------------------------ "On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question." -- Charles Babbage (1791-1871) ---- From Computer Stupidities: http://rinkworks.com/stupid/ -- Ko-Kang Kevin Wang Master of Science (MSc) Student SLC Tutor and Lab Demonstrator Department of Statistics University of Auckland New Zealand Homepage: http://www.stat.auckland.ac.nz/~kwan022 Ph: 373-7599 x88475 (City) x88480 (Tamaki)
Could you be more specific? Do you mean the chapter on connections? Ko-Kang Kevin Wang wrote:> Hi, > > Have you looked at "R Data Import/Export"? > > On Mon, 25 Aug 2003, Murray Jorgensen wrote: > >
Dear Murray, One way that works very well for many people (including me) is to store the data in an external database, such as MySQL, and read in just the bits you want using the excellent package RODBC. Getting a database to do all the selecting is very fast and efficient, leaving R to concentrate on the analysis and visualisation. This is all described in the R Import/Export Manual. Regards, Andrew C. Ward CAPE Centre Department of Chemical Engineering The University of Queensland Brisbane Qld 4072 Australia andreww at cheque.uq.edu.au Quoting Murray Jorgensen <maj at stats.waikato.ac.nz>:> I'm wondering if anyone has written some functions or > code for handling > very large files in R. I am working with a data file that > is 41 > variables times who knows how many observations making up > 27MB altogether. > > The sort of thing that I am thinking of having R do is > > - count the number of lines in a file > > - form a data frame by selecting all cases whose line > numbers are in a > supplied vector (which could be used to extract random > subfiles of > particular sizes) > > Does anyone know of a package that might be useful for > this? > > Murray > > -- > Dr Murray Jorgensen > http://www.stats.waikato.ac.nz/Staff/maj.html > Department of Statistics, University of Waikato, > Hamilton, New Zealand > Email: maj at waikato.ac.nz > Fax 7 838 4155 > Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile > 021 1395 862 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >
Murray Jorgensen <maj at stats.waikato.ac.nz> wrote: I'm wondering if anyone has written some functions or code for handling very large files in R. I am working with a data file that is 41 variables times who knows how many observations making up 27MB altogether. Does that really count as "very large"? I tried making a file where each line was "1 2 3 .... 39 40 41" With 240,000 lines it came to 27.36 million bytes. You can *hold* that amount of data in R quite easily. The problem is the time it takes to read it using scan() or read.table(). The sort of thing that I am thinking of having R do is - count the number of lines in a file - form a data frame by selecting all cases whose line numbers are in a supplied vector (which could be used to extract random subfiles of particular sizes) Does anyone know of a package that might be useful for this? There's a Unix program I posted to comp.sources years ago called "sample": sample -(how many) <(where from) selects the given number of lines without replacement its standard input and writes them in random order to its standard output. Hook it up to a decent random number generator and you're pretty much done: read.table() and scan() can read from a pipe.
I think that is only a medium-sized file. On Mon, 25 Aug 2003, Murray Jorgensen wrote:> I'm wondering if anyone has written some functions or code for handling > very large files in R. I am working with a data file that is 41 > variables times who knows how many observations making up 27MB altogether. > > The sort of thing that I am thinking of having R do is > > - count the number of lines in a fileYou can do that without reading the file into memory: use system(paste("wc -l", filename)) or read in blocks of lines via a connection> - form a data frame by selecting all cases whose line numbers are in a > supplied vector (which could be used to extract random subfiles of > particular sizes)R should handle that easily in today's memory sizes. Buy some more RAM if you don't already have 1/2Gb. As others have said, for a real large file, use a RDBMS to do the selection for you. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:>I think that is only a medium-sized file."Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not. More responses interspesed below. BTW, I forgot to mention that I'm using Windows and so do not have nice unix tools readily available.>On Mon, 25 Aug 2003, Murray Jorgensen wrote: > >> I'm wondering if anyone has written some functions or code for handling >> very large files in R. I am working with a data file that is 41 >> variables times who knows how many observations making up 27MB altogether. >> >> The sort of thing that I am thinking of having R do is >> >> - count the number of lines in a file > >You can do that without reading the file into memory: use >system(paste("wc -l", filename))Don't think that I can do that in Windows XL. or read in blocks of lines via a>connectionBut that does sound promising!> >> - form a data frame by selecting all cases whose line numbers are in a >> supplied vector (which could be used to extract random subfiles of >> particular sizes) > >R should handle that easily in today's memory sizes. Buy some more RAM if >you don't already have 1/2Gb. As others have said, for a real large file, >use a RDBMS to do the selection for you.It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets. Murray> >-- >Brian D. Ripley, ripley at stats.ox.ac.uk >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >University of Oxford, Tel: +44 1865 272861 (self) >1 South Parks Road, +44 1865 272866 (PA) >Oxford OX1 3TG, UK Fax: +44 1865 272595 >Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
>>> I'm wondering if anyone has written some functions or code for >>> handling >>> very large files in R. I am working with a data file that is 41 >>> variables times who knows how many observations making up 27MBaltogether.>>> >>> The sort of thing that I am thinking of having R do is >>> >>> - count the number of lines in a file >> >>You can do that without reading the file into memory: use >>system(paste("wc -l", filename)) > >Don't think that I can do that in Windows XL.There are many ports of unix tools for windows, a recommended collection for R kindly provided here: http://www.stats.ox.ac.uk/pub/Rtools/tools.zip This includes "wc". Cheers, Jim James A. Rogers, Ph.D. <rogers at cantatapharm.com> Statistical Scientist Cantata Pharmaceuticals 300 Technology Square, 5th floor Cambridge, MA 02139 617.225.9009 x312 Fax 617.225.9010
Murray Jorgensen <maj at stats.waikato.ac.nz> wrote: "Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not. I repeat my earlier observation. The AMOUNT OF DATA is easily handled a typical desktop machine these days. The problem is not the amount of data. The problem is HOW LONG IT TAKES TO READ. I made several attempts to read the test file I created yesterday, and each time gave up impatiently after 5+ minutes elapsed time. I tried again today (see below) and went away to have a cop of tea &c; took nearly 10 minute that time and still hadn't finished. 'mawk' read _and processed_ the same file happily in under 30 seconds. One quite serious alternative would be to write a little C function to read the file into an array, and call that from R.> system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41))[1] 3.28 0.79 4.28 0.00 0.00> system.time(save(m, file="m.bin"))[1] 8.44 0.54 9.08 0.00 0.00> m <- NULL > system.time(load("m.bin"))[1] 11.25 0.19 11.51 0.00 0.00> length(m)[1] 10250000 The binary file m.bin is 41 million bytes. This little transcript shows that a data set of this size can be comfortably read from disc in under 12 seconds, on the same machine where scan() took about 50 times as long before I killed it. So yet another alternative is to write a little program that converts the data file to R binary format, and then just read the whole thing in. I think readers will agree that 12 seconds on a 500MHz machine counts as "takes less than 30s". It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets. R is *good* at it, it's just not *quick*. Trying to select a subset in scan() or read.table() wouldn't help all that much, because it would still have to *scan* the data to determine what to skip. Two more times: An unoptimised C program writing 0:(41*250000-1) as a file of 41-number lines: f% time a.out >m.txt 13.0u 1.0s 0:14 94% 0+0k 0+0io 0pf+0w> system.time(m <- read.table("m.txt", header=FALSE))^C Timing stopped at: 552.01 15.48 584.51 0 0 To my eyes, src/main/scan.c shows no signs of having been tuned for speed. The goals appear to have been power (the R scan() function has LOTS of options) and correctness, which are perfectly good goals, and the speed of scan() and read.table() with modest data sizes is quite good enough. The huge ratio (>552)/(<30) for R/mawk does suggest that there may be room for some serious improvement in scan(), possibly by means of some extra hints about total size, possibly by creating a fast path through the code. Of course the big point is that however long scan() takes to read the data set, it only has to be done once. Leave R running overnight and in the morning save the dataset out as an R binary file using save(). Then you'll be able to load it again quickly.
> From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] > > Murray Jorgensen <maj at stats.waikato.ac.nz> wrote: > "Large" for my purposes means "more than I really want to read > into memory" which in turn means "takes more than 30s". I'm at > home now and the file isn't so I'm not sure if the file is large > or not. > > I repeat my earlier observation. The AMOUNT OF DATA is > easily handled a typical desktop machine these days. The > problem is not the amount of data. The problem is HOW LONG > IT TAKES TO READ. I made several attempts to read the test > file I created yesterday, and each time gave up impatiently > after 5+ minutes elapsed time. I tried again today (see > below) and went away to have a cop of tea &c; took nearly 10 > minute that time and still hadn't finished. 'mawk' read _and > processed_ the same file happily in under 30 seconds. > > One quite serious alternative would be to write a little C > function to read the file into an array, and call that from R. > > > system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41)) > [1] 3.28 0.79 4.28 0.00 0.00 > > system.time(save(m, file="m.bin")) > [1] 8.44 0.54 9.08 0.00 0.00 > > m <- NULL > > system.time(load("m.bin")) > [1] 11.25 0.19 11.51 0.00 0.00 > > length(m) > [1] 10250000I tried the following on my IBM T22 Thinkpad (P3-933 w/ 512MB):> system.time(x <- matrix(runif(41*250000), 250000, 41))[1] 6.02 0.40 6.52 NA NA> object.size(x)[1] 82000120> system.time(write(t(x), file="try.dat", ncol=41))[1] 192.12 81.60 279.64 NA NA> system.time(xx <- matrix(scan("try.dat"), byrow=TRUE, ncol=41))Read 10250000 items [1] 110.90 1.09 126.89 NA NA> system.time(xx <- read.table("try.dat", header=FALSE,+ colClasses=rep("numeric", 41))) [1] 106.61 0.48 110.66 NA NA> system.time(save(x, file="try.rda"))[1] 9.15 1.05 19.12 NA NA> rm(x) > system.time(load("try.rda"))[1] 10.22 0.33 10.69 NA NA The last few lines show that the timing I get is approximately the same as yours, so the other timings shouldn't be too different. I don't think I can make coffee that fast. (No, I don't drink it black!) Andy> > The binary file m.bin is 41 million bytes. > > This little transcript shows that a data set of this size can > be comfortably read from disc in under 12 seconds, on the > same machine where scan() took about 50 times as long before > I killed it. > > So yet another alternative is to write a little program that > converts the data file to R binary format, and then just read > the whole thing in. I think readers will agree that 12 > seconds on a 500MHz machine counts as "takes less than 30s". > > It's just that R is so good in reading in initial > segments of a file that I > can't believe that it can't be effective in reading more general > (pre-specified) subsets. > > R is *good* at it, it's just not *quick*. Trying to select a > subset in scan() or read.table() wouldn't help all that much, > because it would still have to *scan* the data to determine > what to skip. > > Two more times: > An unoptimised C program writing 0:(41*250000-1) as a file of > 41-number lines: f% time a.out >m.txt 13.0u 1.0s 0:14 94% > 0+0k 0+0io 0pf+0w > > system.time(m <- read.table("m.txt", header=FALSE)) > ^C > Timing stopped at: 552.01 15.48 584.51 0 0 > > To my eyes, src/main/scan.c shows no signs of having been > tuned for speed. The goals appear to have been power (the R > scan() function has LOTS of > options) and correctness, which are perfectly good goals, and > the speed of scan() and read.table() with modest data sizes > is quite good enough. > > The huge ratio (>552)/(<30) for R/mawk does suggest that > there may be room for some serious improvement in scan(), > possibly by means of some extra hints about total size, > possibly by creating a fast path through the code. > > Of course the big point is that however long scan() takes to > read the data set, it only has to be done once. Leave R > running overnight and in the morning save the dataset out as > an R binary file using save(). Then you'll be able to load it > again quickly. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.
As some of the conversation has noted the 30 second mark as an arbitrary benchmark I would also chime in that there is also an assumption that any non-R related issues that impact upon being able to usefully use R should be ignored. In the real world we can't always control everything about our environment. So if there are improvements that can be made that help mitigate the reality of the world, I would welcome them. As a little test I broke the rules of my organisation and actually put a dataset on my C: drive. Not unexpectedly, the performance vastly improved. What would in the normal (at home) be a 10 second load becomes a 40 second load in a corporate environment. I have found the conversation helpful and it would appear that there are opportunities for improvement that I would find helpful in my production environment. The other aside is that I have no UNIX like tools, not because they don't exist, but because the environment I work in does not allow me to use them. This is not sufficient reason for me to bleat about it. It just is. By and large, I just get on with it. My point is that while I accept that these issues are peripheral to R, they do impact upon the useability of R. I'm sure that there are people working with large databases in R (The SPSS datasets that I regularly interact with vary between 97MB and 200MB) It could be finger trouble on my part, but I find I have to subset them before I can read them into R. If I thought I could usefully convert these datasets into something that R could pick and choose from without reaching the out of memory problem, I would be very happy. In the meantime my lack of expertise has left me with a workable albeit clumsy process. I will continue to champion R in my organisation, but the present score is SPSS-50, SAS-149, R-1. But all the really creative charts only come from one engine in this place.> system.time(load("P:/.../0203Mapdata.rdata"))[1] 9.79 0.97 37.45 NA NA> system.time(load("C:/TEMP/0203Mapdata.rdata"))[1] 10.07 0.18 10.49 NA NA> version_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 1 minor 7.1 year 2003 month 06 day 16 language R _________________________________________________ Tom Mulholland Senior Policy Officer WA Country Health Service Tel: (08) 9222 4062 The contents of this e-mail transmission are confidential and may be protected by professional privilege. The contents are intended only for the named recipients of this e-mail. If you are not the intended recipient, you are hereby notified that any use, reproduction, disclosure or distribution of the information contained in this e-mail is prohibited. Please notify the sender immediately. -----Original Message----- From: Murray Jorgensen [mailto:maj at stats.waikato.ac.nz] Sent: Monday, 25 August 2003 5:16 PM To: Prof Brian Ripley Cc: R-help Subject: Re: [R] R tools for large files At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:>I think that is only a medium-sized file."Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not. More responses interspesed below. BTW, I forgot to mention that I'm using Windows and so do not have nice unix tools readily available.>On Mon, 25 Aug 2003, Murray Jorgensen wrote: > >> I'm wondering if anyone has written some functions or code for >> handling >> very large files in R. I am working with a data file that is 41 >> variables times who knows how many observations making up 27MBaltogether.>> >> The sort of thing that I am thinking of having R do is >> >> - count the number of lines in a file > >You can do that without reading the file into memory: use >system(paste("wc -l", filename))Don't think that I can do that in Windows XL. or read in blocks of lines via a>connectionBut that does sound promising!> >> - form a data frame by selecting all cases whose line numbers are in >> a >> supplied vector (which could be used to extract random subfiles of >> particular sizes) > >R should handle that easily in today's memory sizes. Buy some more RAM>if >you don't already have 1/2Gb. As others have said, for a real largefile,>use a RDBMS to do the selection for you.It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets. Murray> >-- >Brian D. Ripley, ripley at stats.ox.ac.uk >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >University of Oxford, Tel: +44 1865 272861 (self) >1 South Parks Road, +44 1865 272866 (PA) >Oxford OX1 3TG, UK Fax: +44 1865 272595 >Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862 ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Hi Martin, I don't know much about the concept of "connection" but I had supposed it to at least include the concept of "file" and perhaps also "input device" and "output device'. I guess the important point that you are making is that it is sequential in the sense that you describe. I suppose at the time that I wrote my emails I didn't *know* that this was the case but rather assumed that this must be so, since it would be tedious in the extreme to have to work with the access functions if they kept going back to the beginning of the connection. It may help to explain the application. The large files that I am working with are themselves statistical summaries of internet traffic flows (you will appreciate why they can be almost arbitrarily large!) I am interested in clustering these flows into different classes of traffic. I am using a model-based approach, so that the end-point will be statistical models for each cluster. Once these have been estimated they may be used in the classification of future traffic [including a residual class of traffic that does not fit any cluster well]. Based on experience with my clustering software (Multimix) I believe that it should work well on data sets of, say, 3000 observations. I plan to select a small number of random subsets of this size. The replication of these subsets should help me with model selection questions (How many Clusters? How complex should each cluster model be?) Tom Mulholland makes a good point when he notes that many R users (and other users) have very little control over their computing environment owing to somewhat arbitrary IT management decisions. For this reason it will be advantageous to have several solutions to large file problems. I'm pleased that you think that efficient R functions for manipulating numbered lines from files may be written. I'm going to have a go at it just as soon as I finish a big item of paperwork! BTW, I will be out of town and with much reduced email access over the next week or so, so if I don't reply to the list or individuals this should not be put down to laziness or rudeness! Cheers, Murray Jorgensen PS Give my regards to Chris Hennig. Martin Maechler wrote:> Hi Murray, > > from reading your summarizing reply, I wonder if you missed the > most important point about "connection"s connection := generalization of file): > > Once you open() one, you can read it **sequentially**, e.g., in > bunches of a "few" lines i.e., you don't re-start from the > beginning each time. > I think this will allow to devise a pretty efficient R function > for reading (and returning as a vector of strings) line numbers > (n1, n2,..., nm). > > Did you know this? If not, maybe you forward this answer (and > your reaction to it) to R-help as well. > > Regards, > Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ > Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 > ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND > phone: x-41-1-632-3408 fax: ...-1228 <>< > >-- Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
A starting point might be the string splitting function strsplit For example,> X = c("1,4,5" "1,2,5" "5,1,2") > strsplit(X)[[1]] [1] "1" "4" "5" [[2]] [1] "1" "2" "5" [[3]] [1] "5" "1" "2" This returns a list of the parsed vectors. Next you can do something like:> Z = data.frame(matrix(unlist(X), nrow = 3, byrow=T)) > ZX1 X2 X3 1 1 4 5 2 1 2 5 3 5 1 2 -----Original Message----- From: Ted.Harding at nessie.mcc.ac.uk [mailto:Ted.Harding at nessie.mcc.ac.uk] Sent: 26 August 2003 09:00 To: R-help Subject: Re: [R] R tools for large files This has been an interesting thread! My first reaction to Murray's query was to think "use standard Unix tools, especially awk", 'awk' being a compact, fast, efficient program with great powers for processing lines of data files (and in particular extracting, subsetting and transforming database-like files e.g. CSV-type). Of course, that became a sub-thread in its own right. But -- and here I know I'm missing a trick which is why I'm responding now so that someone who knows the trick can tell me -- while I normally use 'awk' "externally" (i.e. I filter a data file through an 'awk' program outside of R and then read the resulting file into R), I began to think about doing it from within R. Something on the lines of X <- system("cat raw_data | awk '...' ", intern=TRUE) would create an object X which is a character vector, each element of which is one line from the output of the command "cat ...... ". E.g. if "raw_data" starts out as 1,2,3,4,5 1,3,4,2,5 5,4,3,2,1 5,3,4,1,2 then X<-system("cat raw_data.csv | awk 'BEGIN{FS=\",\"}{if($3>$2){print $1 \",\" $4 \",\" $5}}'", intern=TRUE) gives > X [1] "1,4,5" "1,2,5" "5,1,2" Now my Question: How do I convert X into the dataframe I would have got if I had read this output from a file instead of into the character vector X? In other words, how to convert a vector of character strings, each of which is in comma-separated format as above, into the rows of a data-frame (or matrix, come to that)? With thanks, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 26-Aug-03 Time: 08:59:48 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
On 26-Aug-03 Prof Brian Ripley wrote:> On Tue, 26 Aug 2003 Ted.Harding at nessie.mcc.ac.uk wrote: > [...] >> > X >> [1] "1,4,5" "1,2,5" "5,1,2" >> >> Now my Question: >> [...] >> In other words, how to convert a vector of character strings, each >> of which is in comma-separated format as above, into the rows of >> a data-frame (or matrix, come to that)? > > read.table() on a text connection. > >> X <- c("1,4,5", "1,2,5", "5,1,2") >> read.table(textConnection(X), header=FALSE, sep=",") > V1 V2 V3 > 1 1 4 5 > 2 1 2 5 > 3 5 1 2Thanks, Brian! Just the job. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 26-Aug-03 Time: 10:05:14 ------------------------------ XFMail ------------------------------
Duncan Murdoch <dmurdoch at pair.com> wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines <- readLines("foo.txt", 1100)[1000:1100] I created a dataset thus: # file foo.awk: BEGIN { s = "01" for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i <= n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null >BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175.> system.time(v <- readLines("BIG"))[1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache...> system.time(v <- readLines("BIG", 200000)[199001:200000])[1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR <= 199000 { next } {print} NR == 200000 { exit } extracts lines 199001:20000 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR <= 199000 {next} {a[NR] = $0} NR == 200000 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result <- scan(textConnection(lines), list( .... ))> system.time(m <- scan(textConnection(v), integer(41)))Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine?> vv <- rep(v, 240)Is there any possibility of storing the data in (platform) binary form? Binary connections (R-data.pdf, section 6.5 "Binary connections") can be used to read binary-encoded data. I wrote a little C program to save out the 230175 records of 41 integers each in native binary form. Then in R I did> system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))[1] 0.57 0.52 1.11 0.00 0.00> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))[1] 2.55 0.34 2.95 0.00 0.00 Remember, this doesn't read a *sample* of the data, it reads *all* the data. It is so much faster than the alternatives in R that it just isn't funny. Trying scan() on the file took nearly 10 minutes before I killed it the other day, using readBin() is a thousand times faster than a simple scan() call on this particular data set. There has *got* to be a way of either generating or saving the data in binary form, using only "approved" Windows tools. Heck, it can probably be done using VBA. By the way, I've read most of the .pdf files I could find on the CRAN site, but haven't noticed any description of the R save-file format. Where should I have looked? (Yes, I know about src/main/saveload.c; I was hoping for some documentation, with maybe some diagrams.)
If we are going to use unix tools to create a new dataset before calling into R, why not simply use cat my_big_bad_file | tail +1001 | head -100 to read lines 1000-1100 (assuming one header row). Or if you have the shortlisted rownames in one file, you can use join after sort. A working example follows. ######################################################################## ######### #!/bin/bash # match.sh last modified 10/07/03 # Does the same thing as egrep 'a|b|c|...' file but in batch mode # A script that matches all occurances of <shortlist> in <data> using the first column as common key if [ $# -ne 2 ]; then echo "Usage: ${0/*\/} <shortlist> <data>" exit fi TEMP1=/tmp/temp1.`date "+%y%m%d-%H%M%S"` TEMP2=/tmp/temp2.`date "+%y%m%d-%H%M%S"` TEMP3=/tmp/temp3.`date "+%y%m%d-%H%M%S"` TEMP4=/tmp/temp4.`date "+%y%m%d-%H%M%S"` TEMP5=/tmp/temp5.`date "+%y%m%d-%H%M%S"` grep -n . $1 | cut -f1 -d: | paste - $1 > $TEMP1 sort -k 2 $TEMP1 > $TEMP2 tail +2 $2 | sort -k 1 > $TEMP3 # Assume data file has header headerRow=`head -1 $2` join -j1 2 -j2 1 -a 1 -t\ $TEMP2 $TEMP3 > $TEMP4 sort -n -k 2 $TEMP4 > $TEMP5 /bin/echo "$headerRow" cut -f1,3- $TEMP5 # column 2 contains orderings rm $TEMP1 $TEMP2 $TEMP3 $TEMP4 ######################################################################## ##### -----Original Message----- From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] Sent: Wednesday, August 27, 2003 9:04 AM To: r-help at stat.math.ethz.ch Subject: Re: [R] R tools for large files Duncan Murdoch <dmurdoch at pair.com> wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines <- readLines("foo.txt", 1100)[1000:1100] I created a dataset thus: # file foo.awk: BEGIN { s = "01" for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i <= n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null >BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175.> system.time(v <- readLines("BIG"))[1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache...> system.time(v <- readLines("BIG", 200000)[199001:200000])[1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR <= 199000 { next } {print} NR == 200000 { exit } extracts lines 199001:20000 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR <= 199000 {next} {a[NR] = $0} NR == 200000 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result <- scan(textConnection(lines), list( .... ))> system.time(m <- scan(textConnection(v), integer(41)))Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine?> vv <- rep(v, 240)Is there any possibility of storing the data in (platform) binary form? Binary connections (R-data.pdf, section 6.5 "Binary connections") can be used to read binary-encoded data. I wrote a little C program to save out the 230175 records of 41 integers each in native binary form. Then in R I did> system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))[1] 0.57 0.52 1.11 0.00 0.00> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))[1] 2.55 0.34 2.95 0.00 0.00 Remember, this doesn't read a *sample* of the data, it reads *all* the data. It is so much faster than the alternatives in R that it just isn't funny. Trying scan() on the file took nearly 10 minutes before I killed it the other day, using readBin() is a thousand times faster than a simple scan() call on this particular data set. There has *got* to be a way of either generating or saving the data in binary form, using only "approved" Windows tools. Heck, it can probably be done using VBA. By the way, I've read most of the .pdf files I could find on the CRAN site, but haven't noticed any description of the R save-file format. Where should I have looked? (Yes, I know about src/main/saveload.c; I was hoping for some documentation, with maybe some diagrams.) ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Duncan Murdoch <dmurdoch at pair.com> wrote: One complication with reading a block at a time is what to do when you read too far. It's called "buffering". Not all connections can use seek() to reposition to the beginning, so you'd need to read them one character at a time, (or attach a buffer somehow, but then what about rw connections?) You don't need seek() to do buffered block-at-a-time reading. For example, you can't lseek() on a UNIX terminal, but UNIX C stdio *does* read a block at a time from a terminal. I don't see what the problem with read-write connections is supposed to be. When you want to read from such a connection, you first force out any buffered output, and then you read a buffer's worth (if available) of input. Of course the read buffer and the write buffer are separate (C stdio has traditionally got this wrong, with the perverse consequence that you have to fseek() when switching from reading to writing or vice versa, but that doesn't mean it can't be got right). To put all this in context though, remember that S was designed in a UNIX environment to work in a UNIX environment and it was always intended to exploit UNIX tools. Even on a Windows box, if you get R, you get a bunch of the usual UNIX tools with it. Amongst other things, Perl is freely available for Windows, a Perl program to read a couple of hundred thousand records and spit them out in platform binary would only be a few lines long, and R _is_ pretty good at reading binary data. It really is important that R users should be allowed to use it the way that the language was designed to be used.