Hi all, I am a beginner trying to use R to work with large amounts of oceanographic data, and I find that computations can be VERY slow. In particular, computational speed seems to depend strongly on the number and size of the objects that are loaded (when R starts up). The same computations are significantly faster when all but the essential objects are removed. I am running R on a machine with 16 GB of RAM, and our unix system manager assures me that there is memory available to my R process that has not been used. 1. Is the problem associated with how R uses memory? If so, is there some way to increase the amount of memory used by my R process to get better performance? The computations that are particularly slow involve looping with by(). The data are measurements of vertical profiles of pressure, temperature, and salinity at a number of stations, which are organized into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.), and the objective is to get a much smaller dataframe and the unique values for ID is 1409 with the minimum and maximum pressure for each profile. The slow part is: h.maxmin <- by(p.1,p.1$id,function(x){ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}) 2. Even with unneeded data objects removed, this is very slow. Is there a faster way to get the maximum and minimum values? platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status major 1 minor 7.0 year 2003 month 04 day 16 language R Thank you for your time. Helen
Douglas Bates
2003-Jul-01 14:31 UTC
[R] Computations slow in spite of large amounts of RAM.
"Huiqin Yang" <Huiqin.Yang at noaa.gov> writes:> Hi all, > > I am a beginner trying to use R to work with large amounts of > oceanographic data, and I find that computations can be VERY slow. In > particular, computational speed seems to depend strongly on the number > and size of the objects that are loaded (when R starts up). The same > computations are significantly faster when all but the essential > objects are removed. I am running R on a machine with 16 GB of RAM, > and our unix system manager assures me that there is memory available > to my R process that has not been used. > > 1. Is the problem associated with how R uses memory? If so, is there > some way to increase the amount of memory used by my R process to get > better performance?You could try setting a large nsize and vsize using mem.limits See the description in ?Memory> The computations that are particularly slow involve looping with > by(). The data are measurements of vertical profiles of pressure, > temperature, and salinity at a number of stations, which are organized > into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.), > and the objective is to get a much smaller dataframe and the unique > values for ID is 1409 with the minimum and maximum pressure for each > profile. The slow part is: > > h.maxmin <- by(p.1,p.1$id,function(x){ > data.frame(id=x$id[1], > maxp=max(x$p), > minp=min(x$p))})I think it would be faster to use h.maxmin <- tapply(p.1$p, p.1$id, range) In the call to by you are subsetting the entire data frame and that probably means taking at least one copy of that frame. If you use tapply on only the relevant columns you will use much less space.> 2. Even with unneeded data objects removed, this is very slow. Is > there a faster way to get the maximum and minimum values?See above. -- Douglas Bates bates at stat.wisc.edu Statistics Department 608/262-2598 University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/
> From: Huiqin Yang [mailto:Huiqin.Yang at noaa.gov] > > Hi all, > > I am a beginner trying to use R to work with large amounts of > oceanographic data, and I find that computations can be VERY > slow. In particular, computational speed seems to depend > strongly on the number and size of the objects that are > loaded (when R starts up). The same computations are > significantly faster when all but the essential objects are > removed. I am running R on a machine with 16 GB of RAM, and > our unix system manager assures me that there is memory > available to my R process that has not been used. > > 1. Is the problem associated with how R uses memory? If so, > is there some way to increase the amount of memory used by my > R process to get better performance?Is R compiled as 64-bit? If not, it won't be able to use more than 4GB of RAM (that's my understanding, anyway). R keeps objects in memory, so if you are working with large amount of data, it's a good habit to keep only the absolute essential objects in the workspace, and save() and rm() things you don't need for the computation.> > The computations that are particularly slow involve looping > with by(). The data are measurements of vertical profiles of > pressure, temperature, and salinity at a number of stations, > which are organized into a dataframe p.1 (1925930 rows, 8 > columns: id, p, t, and s, etc.), and the objective is to get > a much smaller dataframe and the unique > values for ID is 1409 with the minimum and maximum pressure > for each profile. The slow part is: > > h.maxmin <- by(p.1,p.1$id,function(x){ > data.frame(id=x$id[1], > maxp=max(x$p), > minp=min(x$p))}) > > 2. Even with unneeded data objects removed, this is very > slow. Is there a faster way to get the maximum and minimum values?Why do you need to use by(), and why have the function return a data frame containing only one row? Here's an experiment on my 900MHz PIII laptop:> n <- 1e5 > dat <- data.frame(id = sort(sample(LETTERS, n, replace=TRUE)),+ p = rnorm(n))> > > system.time(h.maxmin <- by(dat, dat$id,function(x) {+ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))})) [1] 2.75 0.01 2.78 NA NA> system.time(junk <- tapply(dat$p, dat$id, function(x) range(x)))[1] 0.12 0.01 0.13 NA NA If you want to coerce the result to a data frame with id as row names and min and max as the two variables, you can do: junk.dat <- as.data.frame(do.call("rbind", junk)) HTH, Andy> platform sparc-sun-solaris2.9 > arch sparc > os solaris2.9 > system sparc, solaris2.9 > status > major 1 > minor 7.0 > year 2003 > month 04 > day 16 > language R > > Thank you for your time. > > Helen > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, ...{{dropped}}