Several recent questions and answers have mad e me look at some code and I realized that some functions may not be great to use when you are dealing with very large amounts of data that may already be getting close to limits of your memory. Does the function you call to do one thing to your object perhaps overdo it and make multiple copies and not delete them as soon as they are not needed? An example was a recent post suggesting a nice set of tools you can use to convert your data.frame so the columns are integers or dates no matter how they were read in from a CSV file or created. What I noticed is that often copies of a sort were made by trying to change the original say to one date format or another and then deciding which, if any to keep. Sometimes multiple transformations are tried and this may be done repeatedly with intermediates left lying around. Yes, the memory will all be implicitly returned when the function completes. But often these functions invoke yet other functions which work on their copies. You an end up with your original data temporarily using multiple times as much actual memory. R does have features so some things are "shared" unless one copy or another changes. But in the cases I am looking at, changes are the whole idea. What I wonder is whether such functions should clearly call an rm() or the equivalent as soon as possible when something is no longer needed. The various kinds of pipelines are another case in point as they involve all kinds of hidden temporary variables that eventually need to be cleaned up. When are they removed? I have seen pipelines with 10 or more steps as perhaps data is read in, has rows removed or columns removed or re-ordered and grouping applied and merged with others and reports generated. The intermediates are often of similar sizes with the data and if large, can add up. If writing the code linearly using temp1 and temp2 type of variables to hold the output of one stage and the input of the text stage, I would be tempted to add a rm(temp1) as soon as it was finished being used, or just reuse the same name of temp1 so the previous contents are no longer being pointed to and can be taken by the garbage collector at some time. So I wonder if some functions should have a note in their manual pages specifying what may happen to the volume of data as they run. An example would be if I had a function that took a matrix and simply squared it using matrix multiplication. There are various ways to do this and one of them simply makes a copy and invokes the built-in way in R that multiplies two matrices. It then returns the result. So you end up storing basically three times the size of the matrix right before you return it. Other methods might do the actual multiplication in loops operating on subsections of the matrix and if done carefully, never keep more than say 2.1 times as much data around. Or is this not important often enough? All I know, is data may be getting larger much faster than memory in our machines gets larger. [[alternative HTML version deleted]]
First priority is to obtain a correct answer. Second priority is to document it and write tests for it. Third priority is to optimize it. Sometimes it is useful to keep intermediate values around to support supplemental calculations ala "summary", that may or may not lead to using rm where you might think it should be. But often the optimization step is simply neglected. On November 27, 2021 9:56:50 AM PST, Avi Gross via R-help <r-help at r-project.org> wrote:>Several recent questions and answers have mad e me look at some code and I >realized that some functions may not be great to use when you are dealing >with very large amounts of data that may already be getting close to limits >of your memory. Does the function you call to do one thing to your object >perhaps overdo it and make multiple copies and not delete them as soon as >they are not needed? > > > >An example was a recent post suggesting a nice set of tools you can use to >convert your data.frame so the columns are integers or dates no matter how >they were read in from a CSV file or created. > > > >What I noticed is that often copies of a sort were made by trying to change >the original say to one date format or another and then deciding which, if >any to keep. Sometimes multiple transformations are tried and this may be >done repeatedly with intermediates left lying around. Yes, the memory will >all be implicitly returned when the function completes. But often these >functions invoke yet other functions which work on their copies. You an end >up with your original data temporarily using multiple times as much actual >memory. > > > >R does have features so some things are "shared" unless one copy or another >changes. But in the cases I am looking at, changes are the whole idea. > > > >What I wonder is whether such functions should clearly call an rm() or the >equivalent as soon as possible when something is no longer needed. > > > >The various kinds of pipelines are another case in point as they involve all >kinds of hidden temporary variables that eventually need to be cleaned up. >When are they removed? I have seen pipelines with 10 or more steps as >perhaps data is read in, has rows removed or columns removed or re-ordered >and grouping applied and merged with others and reports generated. The >intermediates are often of similar sizes with the data and if large, can add >up. If writing the code linearly using temp1 and temp2 type of variables to >hold the output of one stage and the input of the text stage, I would be >tempted to add a rm(temp1) as soon as it was finished being used, or just >reuse the same name of temp1 so the previous contents are no longer being >pointed to and can be taken by the garbage collector at some time. > > > >So I wonder if some functions should have a note in their manual pages >specifying what may happen to the volume of data as they run. An example >would be if I had a function that took a matrix and simply squared it using >matrix multiplication. There are various ways to do this and one of them >simply makes a copy and invokes the built-in way in R that multiplies two >matrices. It then returns the result. So you end up storing basically three >times the size of the matrix right before you return it. Other methods >might do the actual multiplication in loops operating on subsections of the >matrix and if done carefully, never keep more than say 2.1 times as much >data around. > > > >Or is this not important often enough? All I know, is data may be getting >larger much faster than memory in our machines gets larger. > > > > > > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
If you have enough data that running out of memory is a serious problem, then a language like R or Python or Octave or Matlab that offers you NO control over storage may not be the best choice. You might need to consider Julia or even Rust. However, if you have enough data that running out of memory is a serious problem, your problems may be worse than you think. In 2021, Linux is *still* having OOM Killer problems. https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/ Your process hogging memory may cause some other process to be killed. Even if that doesn't happen, your process may be simply thrown off the machine without being warned. It may be one of the biggest problems around in statistical computing: how to make it straightforward to carve up a problem so that it can be run on many machines. R has the 'Rmpi' and 'snow' packages, amongst others. https://CRAN.R-project.org/view=HighPerformanceComputing Another approach is to select and transform data outside R. If you have data in some kind of data base then doing select and transform in the data base may be a good approach. On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help <r-help at r-project.org> wrote:> Several recent questions and answers have mad e me look at some code and I > realized that some functions may not be great to use when you are dealing > with very large amounts of data that may already be getting close to limits > of your memory. Does the function you call to do one thing to your object > perhaps overdo it and make multiple copies and not delete them as soon as > they are not needed? > > > > An example was a recent post suggesting a nice set of tools you can use to > convert your data.frame so the columns are integers or dates no matter how > they were read in from a CSV file or created. > > > > What I noticed is that often copies of a sort were made by trying to change > the original say to one date format or another and then deciding which, if > any to keep. Sometimes multiple transformations are tried and this may be > done repeatedly with intermediates left lying around. Yes, the memory will > all be implicitly returned when the function completes. But often these > functions invoke yet other functions which work on their copies. You an end > up with your original data temporarily using multiple times as much actual > memory. > > > > R does have features so some things are "shared" unless one copy or another > changes. But in the cases I am looking at, changes are the whole idea. > > > > What I wonder is whether such functions should clearly call an rm() or the > equivalent as soon as possible when something is no longer needed. > > > > The various kinds of pipelines are another case in point as they involve > all > kinds of hidden temporary variables that eventually need to be cleaned up. > When are they removed? I have seen pipelines with 10 or more steps as > perhaps data is read in, has rows removed or columns removed or re-ordered > and grouping applied and merged with others and reports generated. The > intermediates are often of similar sizes with the data and if large, can > add > up. If writing the code linearly using temp1 and temp2 type of variables to > hold the output of one stage and the input of the text stage, I would be > tempted to add a rm(temp1) as soon as it was finished being used, or just > reuse the same name of temp1 so the previous contents are no longer being > pointed to and can be taken by the garbage collector at some time. > > > > So I wonder if some functions should have a note in their manual pages > specifying what may happen to the volume of data as they run. An example > would be if I had a function that took a matrix and simply squared it using > matrix multiplication. There are various ways to do this and one of them > simply makes a copy and invokes the built-in way in R that multiplies two > matrices. It then returns the result. So you end up storing basically three > times the size of the matrix right before you return it. Other methods > might do the actual multiplication in loops operating on subsections of the > matrix and if done carefully, never keep more than say 2.1 times as much > data around. > > > > Or is this not important often enough? All I know, is data may be getting > larger much faster than memory in our machines gets larger. > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]