jens.lund@nordea.com
2001-Oct-10 15:43 UTC
[Rd] Wrong df in bartlett.test when subsetting data (PR#1124)
Full_Name: Jens Lund Version: 1.3.1 OS: Windows Submission from: (NULL) (194.239.194.210) When we subset a dataset bartlett.test does not calculate the df correctly: X <- rnorm(50) # Generate simulated samples f <- gl(10,5) # Generate a factor with 10 levels and # 5 replications of each level # Works correctly bartlett.test(X,f) # Correct number of df = 9 # Just use the first 5 levels of f # Does not work correctly bartlett.test(X[1:25],f[1:25]) # Should have df = 4, but uses df = 9 bartlett.test(X~f,subset=1:25) # Should have df = 4, but uses df = 9 In version 1.2.2 the problem could be solved by substituting k <- nlevels(g) with k <- length(unique(g)) (I do not get the R-code in the new version get by typing bartlett.test at the prompt.) This might leed to a discussion on whether nlevels should return the number of attained factor levels or the number of "theoretical" levels: nlevels(f) # Returns 10, fine! nlevels(f[1:25]) # Returns 10, might not be fine; 5 might be the correct answer. The same question might be asked for levels, levels(f) versus levels(f[1:25]). At least the documentation should be clear on this point, which I do not think it is now. (Perhaps the levels attribute on the factor should be changed for a subset of the data, but this is probably not an efficient way of doing things...) A similar problem also comes up with interactions between factors in unbalanced designs: g <- gl(2,40,50) nlevels(f:g) # Returns 20 length(unique(f:g)) # Returns 10 table(f,g) # See the design This bug has also been seen on version 1.2.2 on Linux, but is reproduced for this bug report on:> version_ platform i386-pc-mingw32 arch x86 os Win32 system x86, Win32 status major 1 minor 3.1 year 2001 month 08 day 31 language R Best regards, Jens -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thomas Lumley
2001-Oct-10 15:52 UTC
[Rd] Wrong df in bartlett.test when subsetting data (PR#1124)
On Wed, 10 Oct 2001 jens.lund@nordea.com wrote: <stuff about actual bug in bartlett.test() snipped> > This might leed to a discussion on whether nlevels should return the number of > attained factor levels or the number of "theoretical" levels: > nlevels(f) # Returns 10, fine! > nlevels(f[1:25]) # Returns 10, might not be fine; 5 might be the correct > answer. > The same question might be asked for levels, levels(f) versus levels(f[1:25]). > At least the documentation should be clear on this point, which I do not think > it is now. (Perhaps the levels attribute on the factor should be changed for a > subset of the data, but this is probably not an efficient way of doing > things...) > > A similar problem also comes up with interactions between factors in unbalanced > designs: > g <- gl(2,40,50) > nlevels(f:g) # Returns 20 > length(unique(f:g)) # Returns 10 > table(f,g) # See the designI strongly believe that levels() and nlevels() are doing the right thing here. If a variable is a factor with two levels (eg Male & Female) then it shouldn't stop being a factor with two levels just because you have a small sample. In particular, when you have a single observation g[i] it is still important to know the set of levels it comes from, rather than just the category it is in. -thomas Thomas Lumley Asst. Professor, Biostatistics tlumley@u.washington.edu University of Washington, Seattle -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._