R is so smart! I found that when you switch a column from integer to factor, the memory consumption goes down rather impressively. Now I'd like to learn more. How does R do this? What does R do? How do I learn more? I got to thinking: If I was really smart, I'd see that a factor with 2 levels requires only 1 bit of storage. So I'd be able to cram 8 such factors into a byte. But this would come at the price of complexity of code since reading and writing that object would require sub-byte operations. Does R go this far? I think not, given the more modest gains that I see. Does he go down till a byte? A four-byte word instead of 8-bytes of storage? What are Ncells and Vcells, and what determines his consumption of memory for each kind? If you're curious about this, here's a program that serves as a demo: x <- matrix(as.numeric(runif(1e6)>.5), nrow=100000) D <- data.frame(x) rm(x) # Take stock: gc() sum(gc()[,2]) object.size(D) # Switch to factors -- D$X1 <- factor(D$X1); D$X2 <- factor(D$X2); D$X3 <- factor(D$X3) D$X4 <- factor(D$X4); D$X5 <- factor(D$X5); D$X6 <- factor(D$X6) D$X7 <- factor(D$X7); D$X8 <- factor(D$X8); D$X9 <- factor(D$X9) D$X10 <- factor(D$X10) # Take stock: gc() sum(gc()[,2]) object.size(D) Using this, I find that the cost of these 10 vectors goes down from 12 Meg to 8 Meg. This suggests savings, but not the dramatic impact of recognising that a factor with 2 levels only requires 1 bit. -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
Ajay Narottam Shah wrote:> R is so smart! I found that when you switch a column from integer to > factor, the memory consumption goes down rather impressively. > > Now I'd like to learn more. How does R do this? What does R do?Most numeric variables are stored as 8 byte doubles. Factors are stored as 4 byte integers, plus a table giving the factor levels.> How do > I learn more?You will sometimes find what you want in the R Language Definition, for example here: "Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers. Rather unfortunately users often make use of the implementation in order to make some calculations easier. This, however, is an implementation issue and is not guaranteed to hold in all implementations of R." For more details, there are some implementation documents on developer.r-project.org, but in general the only sure way to find out how something is implemented is to look at the source code. Usually it's a bad idea to rely on the implementation details, as the last sentence quoted above says. If it's not documented, it's subject to change without warning.> > I got to thinking: If I was really smart, I'd see that a factor with 2 > levels requires only 1 bit of storage. So I'd be able to cram 8 such > factors into a byte. But this would come at the price of complexity of > code since reading and writing that object would require sub-byte > operations. Does R go this far? I think not, given the more modest > gains that I see. Does he go down till a byte? A four-byte word > instead of 8-bytes of storage? > > What are Ncells and Vcells, and what determines his consumption of > memory for each kind?See the man pages ?gc, ?Memory, and the source code. Duncan Murdoch
Seemingly Similar Threads
- Interleaving elements of two vectors?
- Need a factor level even though there are no observations
- sem package fails when no of factors increase from 3 to 4
- Loops, Paste, Apply? What is the best way to set up a list of many equations?
- how to add a row vector in a dataframe