Recorded here so others may avoid my mistakes. I have a bunch of files containing fixed width data. The R Data guide suggests that one pre-process them with a script if they are large. They were 50MG and up, and I needed to process another file that gave the layout of the lines anyway. I tried rpy to not only preprocess but create the R data object in one go. It seemed like a good idea; it wasn't. The core operation, was to build up a string for each line that looked like "data.frame(var1=val1, var2=val2, [etc])" and then rbind this to the data.frame so far. I did this with r(mycommand string). Almost all the values were numeric. This was incredibly slow, being unable to complete after running overnight. So, the lesson is, don't do that! I switched to preprocessing that created a csv file, and then read.csv from R. This worked in under a minute. The result had dimension 150913 x 129. The good news in rpy was that I found objects persisted across calls to the r object. Exactly why this was so slow I don't know. The two obvious suspects the speed of rbind, which I think is pretty inefficient, and the overhead of crossing the python/R boundary. This was on Debian Lenny: python-rpy 1.0.3-2 Python 2.5.2 R 2.7.1 rpy2 is not available in Lenny, though it is in development versions of Debian. Ross Boylan
Gabor Grothendieck
2009-Aug-16 22:36 UTC
[R] good and bad ways to import fixed column data (rpy)
Check out ?read.fwf On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote:> Recorded here so others may avoid my mistakes. > > I have a bunch of files containing fixed width data. ?The R Data guide > suggests that one pre-process them with a script if they are large. > They were 50MG and up, and I needed to process another file that gave > the layout of the lines anyway. > > I tried rpy to not only preprocess but create the R data object in one > go. ?It seemed like a good idea; it wasn't. ?The core operation, was to > build up a string for each line that looked like "data.frame(var1=val1, > var2=val2, [etc])" and then rbind this to the data.frame so far. ?I did > this with r(mycommand string). Almost all the values were numeric. > > This was incredibly slow, being unable to complete after running > overnight. > > So, the lesson is, don't do that! > > I switched to preprocessing that created a csv file, and then read.csv > from R. ?This worked in under a minute. ?The result had dimension 150913 > x 129. > > The good news in rpy was that I found objects persisted across calls to > the r object. > > Exactly why this was so slow I don't know. ?The two obvious suspects the > speed of rbind, which I think is pretty inefficient, and the overhead of > crossing the python/R boundary. > > This was on Debian Lenny: > python-rpy ? ? ? ? ? ? ? ? ? ?1.0.3-2 > Python 2.5.2 > R 2.7.1 > > rpy2 is not available in Lenny, though it is in development versions of > Debian. > > Ross Boylan > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >