James Savage
2013-May-20 08:34 UTC
[R] table() generating NAs when there are no NAs in the underlying data
Hi all, Just a quick question: I want to generate a column of counts of a particular variable. The easiest way seems to be using table(). For reasonably small amounts of data, there seems to be no problem. C <- data.frame(A1 = sample(1:1000, 100000, replace = TRUE), B1 = sample(1:1000, 100000, replace = TRUE)) C$countC <- table(C$A1)[C$A1] summary(C$countC) Min. 1st Qu. Median Mean 3rd Qu. Max. 65 94 101 101 108 132 However, if I'm building a table from a larger set (note that now I'm sampling from 1:10k, rather than 1:1k), it generates NAs, despite there being no NAs in the data I'm building the table from: C <- data.frame(A1 = sample(1:10000, 100000, replace = TRUE), B1 = sample(1:10000, 100000, replace = TRUE)) C$countC <- table(C$A1)[C$A1] summary(C$A1) Min. 1st Qu. Median Mean 3rd Qu. Max. 1 2512 5005 5008 7502 10000 summary(C$countC) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 8.00 10.00 10.18 12.00 25.00 7 Note that if you cannot replicate this on your computer, try increasing the size of the set to sample from (setting it at 1000000 did the trick for a colleague of mine). The problem appears not to occur if the data are not in a data-frame. A <- sample(1:10000, 1000000, replace = TRUE) summary(table(as.factor(A))[A]) Min. 1st Qu. Median Mean 3rd Qu. Max. 57 94 101 101 108 144 It seems to only have thrown NAs only for the last few categories (in the case sampling from 100k, only 99998, 99999 and 100000). That makes it manageable, but definitely not ideal. I also posted this question to Stack Overflow, and users there have contributed a work-around. However, I would like to know why table() is exhibiting this behaviour. Cheers, Jim [[alternative HTML version deleted]]
Duncan Murdoch
2013-May-20 10:07 UTC
[R] table() generating NAs when there are no NAs in the underlying data
On 13-05-20 4:34 AM, James Savage wrote:> Hi all, > Just a quick question: > I want to generate a column of counts of a particular variable. The easiest way seems to be using table(). For reasonably small amounts of data, there seems to be no problem. > C <- data.frame(A1 = sample(1:1000, 100000, replace = TRUE), B1 = sample(1:1000, 100000, replace = TRUE)) > C$countC <- table(C$A1)[C$A1] > summary(C$countC) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 65 94 101 101 108 132 > > However, if I'm building a table from a larger set (note that now I'm sampling from 1:10k, rather than 1:1k), it generates NAs, despite there being no NAs in the data I'm building the table from: > C <- data.frame(A1 = sample(1:10000, 100000, replace = TRUE), B1 = sample(1:10000, 100000, replace = TRUE)) > C$countC <- table(C$A1)[C$A1] > summary(C$A1) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 1 2512 5005 5008 7502 10000 > > summary(C$countC) > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > 1.00 8.00 10.00 10.18 12.00 25.00 7 > Note that if you cannot replicate this on your computer, try increasing the size of the set to sample from (setting it at 1000000 did the trick for a colleague of mine). > > The problem appears not to occur if the data are not in a data-frame. > A <- sample(1:10000, 1000000, replace = TRUE) > summary(table(as.factor(A))[A]) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 57 94 101 101 108 144 > > It seems to only have thrown NAs only for the last few categories (in the case sampling from 100k, only 99998, 99999 and 100000). That makes it manageable, but definitely not ideal. > I also posted this question to Stack Overflow, and users there have contributed a work-around. However, I would like to know why table() is exhibiting this behaviour.table() isn't generating the NAs. You're indexing the table by values that don't exist in it. You don't give a completely reproducible example (use set.seed() if you want us to see the same random numbers you had), but from the look of it, your C$A1 column contains 9993 unique values. You compute a table, and get a 1 dim array with 9993 entries. Then you index it by C$A1, which contains values larger than 9993, and get some NAs. Duncan Murdoch For example, try this: