Marie Pierre Sylvestre
2007-Dec-19 02:24 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
Dear R users, I am analysing a very large data set and I need to perform several data manipulations. The dataset is so big that the only way I can play with it without having memory problems (E.g. "cannot allocate vectors of size...") is to write a batch script to: 1. cut the data into pieces 2. save the pieces in seperate .RData files 3. Remove everything from the environment 4. load one of the piece 5. perform the manipulations on it 6. save it and remove it from the environment 7. Redo 4-6 for every piece 8. Merge everything together at the end It works if coded line by line but since I'll have to perform these tasks on other data sets, I am trying to automate this as much as I can. I am using a loop in which I used 'assign' and 'get' (pseudo code below). My problem is when I use 'get', it prints the whole object on the screen. I am wondering whether there is a more efficient way to do what I need to do. Any help would be appreciated. Please keep in mind that the whole process is quite computer-intensive, so I can't keep everything in the environment while R performs calculations. Say I have 1 big dataframe called data. I use 'split' to divide it into a list of 12 dataframes (call this list my.list) my.fun is a function that takes a dataframe, performs several manipulations on it and returns a dataframe. for (i in 1:12){ assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works # now I need to save this new object as a RData. # The following line does not work save(paste("data", i, sep = ""), file = paste( paste("data", i, sep ""), "RData", sep=".")) } # This works but it is a bit convoluted!!! temp <- get(paste("data", i, sep = "")) save(temp, file = "lala.RData") } I am *sure* there is something more clever to do but I can't find it. Any help would be appreciated. best regards, MP
Benilton Carvalho
2007-Dec-19 02:33 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
you want to use: save(list=paste("data", i, sep=""), file=paste("data", i, ".Rdata", sep="")) b On Dec 18, 2007, at 9:24 PM, Marie Pierre Sylvestre wrote:> Dear R users, > > I am analysing a very large data set and I need to perform several > data > manipulations. The dataset is so big that the only way I can play > with it > without having memory problems (E.g. "cannot allocate vectors of > size...") > is to write a batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files > 3. Remove everything from the environment > 4. load one of the piece > 5. perform the manipulations on it > 6. save it and remove it from the environment > 7. Redo 4-6 for every piece > 8. Merge everything together at the end > > It works if coded line by line but since I'll have to perform these > tasks > on other data sets, I am trying to automate this as much as I can. > > I am using a loop in which I used 'assign' and 'get' (pseudo code > below). > My problem is when I use 'get', it prints the whole object on the > screen. > I am wondering whether there is a more efficient way to do what I > need to > do. Any help would be appreciated. Please keep in mind that the whole > process is quite computer-intensive, so I can't keep everything in the > environment while R performs calculations. > > Say I have 1 big dataframe called data. I use 'split' to divide it > into a > list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this > works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( paste("data", i, > sep > ""), "RData", sep=".")) > } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but I can't find > it. Any > help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Moshe Olshansky
2007-Dec-20 06:07 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
Won't it be simpler to do: for (i in 1:12){ data <- my.fun(my.list[i])) save(data,file = paste("data",i,".RData", sep="")) } --- Marie Pierre Sylvestre <MP.Sylvestre at epimgh.mcgill.ca> wrote:> Dear R users, > > I am analysing a very large data set and I need to > perform several data > manipulations. The dataset is so big that the only > way I can play with it > without having memory problems (E.g. "cannot > allocate vectors of size...") > is to write a batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files > 3. Remove everything from the environment > 4. load one of the piece > 5. perform the manipulations on it > 6. save it and remove it from the environment > 7. Redo 4-6 for every piece > 8. Merge everything together at the end > > It works if coded line by line but since I'll have > to perform these tasks > on other data sets, I am trying to automate this as > much as I can. > > I am using a loop in which I used 'assign' and 'get' > (pseudo code below). > My problem is when I use 'get', it prints the whole > object on the screen. > I am wondering whether there is a more efficient way > to do what I need to > do. Any help would be appreciated. Please keep in > mind that the whole > process is quite computer-intensive, so I can't keep > everything in the > environment while R performs calculations. > > Say I have 1 big dataframe called data. I use > 'split' to divide it into a > list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, > performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), > my.fun(my.list[i])) # this works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( > paste("data", i, sep > ""), "RData", sep=".")) > } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but > I can't find it. Any > help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. >
Henrik Bengtsson
2007-Dec-20 09:26 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
library(R.utils); for (ii in 1:12) { value <- my.fun(my.list[ii]); saveObject(value, file=sprintf("data%02d.RData", ii)); rm(value); gc(); } for (ii in 1:12) { value <- loadObject(sprintf("data%02d.RData", ii)); } On 18/12/2007, Marie Pierre Sylvestre <MP.Sylvestre at epimgh.mcgill.ca> wrote:> Dear R users, > > I am analysing a very large data set and I need to perform several data > manipulations. The dataset is so big that the only way I can play with it > without having memory problems (E.g. "cannot allocate vectors of size...") > is to write a batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files > 3. Remove everything from the environment > 4. load one of the piece > 5. perform the manipulations on it > 6. save it and remove it from the environment > 7. Redo 4-6 for every piece > 8. Merge everything together at the end > > It works if coded line by line but since I'll have to perform these tasks > on other data sets, I am trying to automate this as much as I can. > > I am using a loop in which I used 'assign' and 'get' (pseudo code below). > My problem is when I use 'get', it prints the whole object on the screen. > I am wondering whether there is a more efficient way to do what I need to > do. Any help would be appreciated. Please keep in mind that the whole > process is quite computer-intensive, so I can't keep everything in the > environment while R performs calculations. > > Say I have 1 big dataframe called data. I use 'split' to divide it into a > list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( paste("data", i, sep > ""), "RData", sep=".")) > } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but I can't find it. Any > help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Greg Snow
2007-Dec-20 20:57 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
Have you looked at the SQLiteDF package? It seems like it would do what you want in a better way and much simpler. Even if that does not work then a database approach (look at the other db packages, probably RODBC first) could be simpler, faster, and easier. You may also want to look at the g.data package for another approach. Depending on what you are doing you may also want to look at the biglm package (some similar functionality is in SQLiteDF). Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org (801) 408-8111> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Marie > Pierre Sylvestre > Sent: Tuesday, December 18, 2007 7:25 PM > To: r-help at r-project.org > Subject: [R] assigning and saving datasets in a loop,with > names changing with "i" > > Dear R users, > > I am analysing a very large data set and I need to perform > several data manipulations. The dataset is so big that the > only way I can play with it without having memory problems > (E.g. "cannot allocate vectors of size...") is to write a > batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files 3. Remove > everything from the environment 4. load one of the piece 5. > perform the manipulations on it 6. save it and remove it from > the environment 7. Redo 4-6 for every piece 8. Merge > everything together at the end > > It works if coded line by line but since I'll have to perform > these tasks on other data sets, I am trying to automate this > as much as I can. > > I am using a loop in which I used 'assign' and 'get' (pseudo > code below). > My problem is when I use 'get', it prints the whole object on > the screen. > I am wondering whether there is a more efficient way to do > what I need to do. Any help would be appreciated. Please keep > in mind that the whole process is quite computer-intensive, > so I can't keep everything in the environment while R > performs calculations. > > Say I have 1 big dataframe called data. I use 'split' to > divide it into a list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), my.fun(my.list[i])) # > this works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( > paste("data", i, sep = ""), "RData", sep=".")) } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but I can't > find it. Any help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Tony Plate
2007-Dec-22 00:55 UTC
[R] assigning and saving datasets in a loop, with names changing with "i"
Marie Pierre Sylvestre wrote:> Dear R users, > > I am analysing a very large data set and I need to perform several data > manipulations. The dataset is so big that the only way I can play with it > without having memory problems (E.g. "cannot allocate vectors of size...") > is to write a batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files > 3. Remove everything from the environment > 4. load one of the piece > 5. perform the manipulations on it > 6. save it and remove it from the environment > 7. Redo 4-6 for every piece > 8. Merge everything together at the end > > It works if coded line by line but since I'll have to perform these tasks > on other data sets, I am trying to automate this as much as I can.The trackObjs package is designed to make it easy to work in approximately this manner -- it saves objects automatically to disk but they are still accessible as normal. Here's how you could do the above - this example works with 10 8Mb objects in a R session with a limit of 40Mb. # allow R only 40Mb of vector memory mem.limits(vsize=40e6) mem.limits()/1e6 library(trackObjs) # start tracking to store data objects in the directory 'data' # each object is 8Mb, and we store 10 of them track.start("data") n <- 10 m <- 1e6 constructObject <- function(i) i+rnorm(m) # steps 1, 2 & 3: for (i in 1:n) { xname <- paste("x", i, sep="") cat("", xname) assign(xname, constructObject(i)) # store in a file, accessible by name: track(list=xname) } cat("\n") gc(TRUE) # accessing object by name object.size(x1)/2^20 # In Mb mean(x1) mean(x2) gc(TRUE) # steps 4:6 # accessing object through a constructed name result <- sapply(1:n, function(i) mean(get(paste("x", i, sep="")))) result # remove the data objects track.remove(list=paste("x", 1:n, sep="")) track.stop() Here's the a full transcript of the above - note how whenever gc() is called there is hardly any vector memory in use. > # allow R only 40Mb of vector memory > mem.limits(vsize=40e6) nsize vsize NA 40000000 > mem.limits()/1e6 nsize vsize NA 40 > library(trackObjs) > # start tracking to store data objects in the directory 'data' > # each object is 8Mb, and we store 10 of them > track.start("data") > n <- 10 > m <- 1e6 > constructObject <- function(i) i+rnorm(m) > # steps 1, 2 & 3: > for (i in 1:n) { + xname <- paste("x", i, sep="") + cat("", xname) + assign(xname, constructObject(i)) + # store in a file, accessible by name: + track(list=xname) + } x1 x2 x3 x4 x5 x6 x7 x8 x9 x10> cat("\n") > gc(TRUE) Garbage collection 19 = 6+0+13 (level 2) ... 4.0 Mbytes of cons cells used (42%) 0.7 Mbytes of vectors used (5%) used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 148362 4.0 350000 9.4 NA 350000 9.4 Vcells 89973 0.7 1950935 14.9 38.2 2112735 16.2 > # accessing object by name > object.size(x1)/2^20 # In Mb [1] 7.629417 > mean(x1) [1] 0.998635 > mean(x2) [1] 1.999656 > gc(TRUE) Garbage collection 22 = 7+1+14 (level 2) ... 4.0 Mbytes of cons cells used (43%) 0.7 Mbytes of vectors used (6%) used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 149264 4.0 350000 9.4 NA 350000 9.4 Vcells 90160 0.7 1560747 12.0 38.2 2112735 16.2 > # steps 4:6 > result <- sapply(1:n, function(i) mean(get(paste("x", i, sep="")))) > result [1] 0.998635 1.999656 2.997368 4.000197 5.000159 6.001216 6.999552 [8] 7.999743 8.999982 10.001355 > # remove the data objects > track.remove(list=paste("x", 1:n, sep="")) [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" > track.stop() >> > I am using a loop in which I used 'assign' and 'get' (pseudo code below). > My problem is when I use 'get', it prints the whole object on the screen. > I am wondering whether there is a more efficient way to do what I need to > do. Any help would be appreciated. Please keep in mind that the whole > process is quite computer-intensive, so I can't keep everything in the > environment while R performs calculations. > > Say I have 1 big dataframe called data. I use 'split' to divide it into a > list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( paste("data", i, sep > ""), "RData", sep=".")) > } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but I can't find it. Any > help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >