Timothy Clough
2009-Sep-20 18:06 UTC
[R] missing level of a nested factor results in an NA in lm output
Hello All, I have posted to this list before regarding the same issue so I apologize for the multiple e-mails. I am still struggling with this issue so I thought I'd give it another try. This time I have included reproducible code and a subset of the data I am analyzing. I am running an ANOVA with three factors: GROUP (5 levels), FEATURE (2 levels), and PATIENT (2 levels), where PATIENT is nested within GROUP. I am interested in estimating various linear functions of the model coefficients (which I sometimes refer to as 'contrasts' below). An example of the data can be set up using the following code: example <- data.frame( ABUNDANCE = rnorm(30, 12), FEATURE = factor(rep(c(3218, 4227, 6374), 10)), GROUP = factor(rep(c(0, 1, 2, 3, 4), 6)), PATIENT = factor(rep(c(1, 2), 15)) ) I am using the lm function to run the model as shown. fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/PATIENT, example) summary(fit) The output of this code is below. > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, example) > summary(fit) Call: lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, data = example) Residuals: Min 1Q Median 3Q Max -5.510e-01 -1.961e-01 3.469e-17 1.961e-01 5.510e-01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.6487 0.4037 31.331 2.58e-11 *** FEATURE4227 -0.3271 0.4944 -0.661 0.52325 FEATURE6374 -1.0743 0.4944 -2.173 0.05492 . GROUP1 -1.2641 0.5709 -2.214 0.05120 . GROUP2 -0.2359 0.5709 -0.413 0.68825 GROUP3 -1.3081 0.5709 -2.291 0.04492 * GROUP4 -0.5867 0.5709 -1.028 0.32836 FEATURE4227:GROUP1 1.5651 0.6993 2.238 0.04915 * FEATURE6374:GROUP1 1.1435 0.6993 1.635 0.13302 FEATURE4227:GROUP2 0.4372 0.6993 0.625 0.54577 FEATURE6374:GROUP2 0.6728 0.6993 0.962 0.35867 FEATURE4227:GROUP3 1.0135 0.6993 1.449 0.17786 FEATURE6374:GROUP3 2.2665 0.6993 3.241 0.00885 ** FEATURE4227:GROUP4 1.3278 0.6993 1.899 0.08679 . FEATURE6374:GROUP4 0.5610 0.6993 0.802 0.44103 GROUP0:PATIENT2 -0.5569 0.4037 -1.379 0.19785 GROUP1:PATIENT2 -0.1104 0.4037 -0.273 0.79014 GROUP2:PATIENT2 -0.9702 0.4037 -2.403 0.03712 * GROUP3:PATIENT2 -0.1400 0.4037 -0.347 0.73586 GROUP4:PATIENT2 -0.5947 0.4037 -1.473 0.17147 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.4944 on 10 degrees of freedom Multiple R-squared: 0.8004, Adjusted R-squared: 0.4211 F-statistic: 2.11 on 19 and 10 DF, p-value: 0.1133 I then use the estimable function to estimate a linear combination of the parameter estimates. myEstimate <- cbind( '(Intercept)' = 1, 'GROUP1' = 1, 'FEATURE4227:GROUP1' = 0.5, 'FEATURE6374:GROUP1' = 0.5, 'GROUP0:PATIENT2' = 1 ) rownames(myEstimate) <- "test" > estimable(fit, myEstimate) Estimate Std. Error t value DF Pr(>|t|) test 12.18198 0.6694812 18.19615 10 5.395944e-09 I am able to get the t-statistic and associated p-value for the contrast as desired. However, I sometimes have a case where there is a missing patient within one of the groups. To give an example of this situation, I remove patient 2 from group 4 and perform the same analysis. example2 <- example[!(example$GROUP == 4 & example$PATIENT == 2),] > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, example2) > summary(fit) Call: lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, data = example2) Residuals: Min 1Q Median 3Q Max -5.510e-01 -2.084e-01 6.099e-20 2.084e-01 5.510e-01 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 12.6487 0.4442 28.475 2.5e-09 *** FEATURE4227 -0.3271 0.5440 -0.601 0.5644 FEATURE6374 -1.0743 0.5440 -1.975 0.0837 . GROUP1 -1.2641 0.6282 -2.012 0.0790 . GROUP2 -0.2359 0.6282 -0.375 0.7171 GROUP3 -1.3081 0.6282 -2.082 0.0709 . GROUP4 -0.4570 0.7023 -0.651 0.5335 FEATURE4227:GROUP1 1.5651 0.7694 2.034 0.0764 . FEATURE6374:GROUP1 1.1435 0.7694 1.486 0.1755 FEATURE4227:GROUP2 0.4372 0.7694 0.568 0.5854 FEATURE6374:GROUP2 0.6728 0.7694 0.874 0.4074 FEATURE4227:GROUP3 1.0135 0.7694 1.317 0.2242 FEATURE6374:GROUP3 2.2665 0.7694 2.946 0.0185 * FEATURE4227:GROUP4 1.2146 0.9423 1.289 0.2334 FEATURE6374:GROUP4 0.2850 0.9423 0.302 0.7700 GROUP0:PATIENT2 -0.5569 0.4442 -1.254 0.2454 GROUP1:PATIENT2 -0.1104 0.4442 -0.248 0.8101 GROUP2:PATIENT2 -0.9702 0.4442 -2.184 0.0605 . GROUP3:PATIENT2 -0.1400 0.4442 -0.315 0.7606 GROUP4:PATIENT2 NA NA NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.544 on 8 degrees of freedom Multiple R-squared: 0.7852, Adjusted R-squared: 0.3019 F-statistic: 1.625 on 18 and 8 DF, p-value: 0.2462 As you can see there are now NAs in the row corresponding to the removed patient. I still want to estimate my contrast so I run the same code as before, but I receive the following error messages: myEstimate <- cbind( '(Intercept)' = 1, 'GROUP1' = 1, 'FEATURE4227:GROUP1' = 0.5, 'FEATURE6374:GROUP1' = 0.5, 'GROUP0:PATIENT2' = 1 ) rownames(myEstimate) <- "test" > estimable(fit, myEstimate) Error in estimable.default(fit, myEstimate) : Dimension of structure(c(1, 0, 0, 1, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0, : 1x20, not compatible with no of parameters in fit: 19 Dimension of 1, 0, 0, 0, 0), .Dim = c(1L, 20L), .Dimnames = list("test", c("(Intercept)", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227", "FEATURE6374", "GROUP1", "GROUP2", "GROUP3", "GROUP4", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227:GROUP1", "FEATURE6374:GROUP1", "FEATURE4227:GROUP2", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE6374:GROUP2", "FEATURE4227:GROUP3", "FEATURE6374:GROUP3", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227:GROUP4", "FEATURE6374:GROUP4", "GROUP0:PATIENT2", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "GROUP1:PATIENT2", "GROUP2:PATIENT2", "GROUP3:PATIENT2", "GROUP4:PATIENT2": 1x20, not compatible with no of parameters in fit: 19 Dimension of ))): 1x2 Apparently the NA in the parameter estimate table is preventing the estimation of the contrast. I have tried multiple alternatives in the coding of the estimable statement, but all give me errors similar to the one above. I think the problem is that R expects to see each patient within each group and when it doesn't see patient 2 in group 4, it yields an NA parameter estimate for this combination. That's fine - the results of the other parameter estimates are still correct. However it's preventing me from estimating a linear function of the model coefficients, and I know that there must be a way to do this even in the presence of the missing combination. Has anyone run into this problem before? Any input would be appreciated. Sorry for the lengthy message. Tim
Timothy Clough
2009-Sep-21 18:19 UTC
[R] missing level of a nested factor results in an NA in lm output
Dear All, I have posted to this list before regarding the same issue so I apologize for the multiple e-mails. I am still struggling with this issue so I thought I'd give it another try. This time I have included reproducible code and a subset of the data I am analyzing. I am running an ANOVA with three factors: GROUP (5 levels), FEATURE (2 levels), and PATIENT (2 levels), where PATIENT is nested within GROUP. I am interested in estimating various linear functions of the model coefficients (which I sometimes refer to as 'contrasts' below). An example of the data can be set up using the following code: example <- data.frame( ABUNDANCE = rnorm(30, 12), FEATURE = factor(rep(c(3218, 4227, 6374), 10)), GROUP = factor(rep(c(0, 1, 2, 3, 4), 6)), PATIENT = factor(rep(c(1, 2), 15)) ) I am using the lm function to run the model as shown. fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/PATIENT, example) summary(fit) The output of this code is below. > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, example) > summary(fit) Call: lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, data = example) Residuals: Min 1Q Median 3Q Max -5.510e-01 -1.961e-01 3.469e-17 1.961e-01 5.510e-01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.6487 0.4037 31.331 2.58e-11 *** FEATURE4227 -0.3271 0.4944 -0.661 0.52325 FEATURE6374 -1.0743 0.4944 -2.173 0.05492 . GROUP1 -1.2641 0.5709 -2.214 0.05120 . GROUP2 -0.2359 0.5709 -0.413 0.68825 GROUP3 -1.3081 0.5709 -2.291 0.04492 * GROUP4 -0.5867 0.5709 -1.028 0.32836 FEATURE4227:GROUP1 1.5651 0.6993 2.238 0.04915 * FEATURE6374:GROUP1 1.1435 0.6993 1.635 0.13302 FEATURE4227:GROUP2 0.4372 0.6993 0.625 0.54577 FEATURE6374:GROUP2 0.6728 0.6993 0.962 0.35867 FEATURE4227:GROUP3 1.0135 0.6993 1.449 0.17786 FEATURE6374:GROUP3 2.2665 0.6993 3.241 0.00885 ** FEATURE4227:GROUP4 1.3278 0.6993 1.899 0.08679 . FEATURE6374:GROUP4 0.5610 0.6993 0.802 0.44103 GROUP0:PATIENT2 -0.5569 0.4037 -1.379 0.19785 GROUP1:PATIENT2 -0.1104 0.4037 -0.273 0.79014 GROUP2:PATIENT2 -0.9702 0.4037 -2.403 0.03712 * GROUP3:PATIENT2 -0.1400 0.4037 -0.347 0.73586 GROUP4:PATIENT2 -0.5947 0.4037 -1.473 0.17147 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.4944 on 10 degrees of freedom Multiple R-squared: 0.8004, Adjusted R-squared: 0.4211 F-statistic: 2.11 on 19 and 10 DF, p-value: 0.1133 I then use the estimable function to estimate a linear combination of the parameter estimates. myEstimate <- cbind( '(Intercept)' = 1, 'GROUP1' = 1, 'FEATURE4227:GROUP1' = 0.5, 'FEATURE6374:GROUP1' = 0.5, 'GROUP0:PATIENT2' = 1 ) rownames(myEstimate) <- "test" > estimable(fit, myEstimate) Estimate Std. Error t value DF Pr(>|t|) test 12.18198 0.6694812 18.19615 10 5.395944e-09 I am able to get the t-statistic and associated p-value for the contrast as desired. However, I sometimes have a case where there is a missing patient within one of the groups. To give an example of this situation, I remove patient 2 from group 4 and perform the same analysis. example2 <- example[!(example$GROUP == 4 & example$PATIENT == 2),] > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, example2) > summary(fit) Call: lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ PATIENT, data = example2) Residuals: Min 1Q Median 3Q Max -5.510e-01 -2.084e-01 6.099e-20 2.084e-01 5.510e-01 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 12.6487 0.4442 28.475 2.5e-09 *** FEATURE4227 -0.3271 0.5440 -0.601 0.5644 FEATURE6374 -1.0743 0.5440 -1.975 0.0837 . GROUP1 -1.2641 0.6282 -2.012 0.0790 . GROUP2 -0.2359 0.6282 -0.375 0.7171 GROUP3 -1.3081 0.6282 -2.082 0.0709 . GROUP4 -0.4570 0.7023 -0.651 0.5335 FEATURE4227:GROUP1 1.5651 0.7694 2.034 0.0764 . FEATURE6374:GROUP1 1.1435 0.7694 1.486 0.1755 FEATURE4227:GROUP2 0.4372 0.7694 0.568 0.5854 FEATURE6374:GROUP2 0.6728 0.7694 0.874 0.4074 FEATURE4227:GROUP3 1.0135 0.7694 1.317 0.2242 FEATURE6374:GROUP3 2.2665 0.7694 2.946 0.0185 * FEATURE4227:GROUP4 1.2146 0.9423 1.289 0.2334 FEATURE6374:GROUP4 0.2850 0.9423 0.302 0.7700 GROUP0:PATIENT2 -0.5569 0.4442 -1.254 0.2454 GROUP1:PATIENT2 -0.1104 0.4442 -0.248 0.8101 GROUP2:PATIENT2 -0.9702 0.4442 -2.184 0.0605 . GROUP3:PATIENT2 -0.1400 0.4442 -0.315 0.7606 GROUP4:PATIENT2 NA NA NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.544 on 8 degrees of freedom Multiple R-squared: 0.7852, Adjusted R-squared: 0.3019 F-statistic: 1.625 on 18 and 8 DF, p-value: 0.2462 As you can see there are now NAs in the row corresponding to the removed patient. I still want to estimate my contrast so I run the same code as before, but I receive the following error messages: myEstimate <- cbind( '(Intercept)' = 1, 'GROUP1' = 1, 'FEATURE4227:GROUP1' = 0.5, 'FEATURE6374:GROUP1' = 0.5, 'GROUP0:PATIENT2' = 1 ) rownames(myEstimate) <- "test" > estimable(fit, myEstimate) Error in estimable.default(fit, myEstimate) : Dimension of structure(c(1, 0, 0, 1, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0, : 1x20, not compatible with no of parameters in fit: 19 Dimension of 1, 0, 0, 0, 0), .Dim = c(1L, 20L), .Dimnames = list("test", c("(Intercept)", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227", "FEATURE6374", "GROUP1", "GROUP2", "GROUP3", "GROUP4", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227:GROUP1", "FEATURE6374:GROUP1", "FEATURE4227:GROUP2", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE6374:GROUP2", "FEATURE4227:GROUP3", "FEATURE6374:GROUP3", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "FEATURE4227:GROUP4", "FEATURE6374:GROUP4", "GROUP0:PATIENT2", : 1x20, not compatible with no of parameters in fit: 19 Dimension of "GROUP1:PATIENT2", "GROUP2:PATIENT2", "GROUP3:PATIENT2", "GROUP4:PATIENT2": 1x20, not compatible with no of parameters in fit: 19 Dimension of ))): 1x2 Apparently the NA in the parameter estimate table is preventing the estimation of the contrast. I have tried multiple alternatives in the coding of the estimable statement, but all give me errors similar to the one above. I think the problem is that R expects to see each patient within each group and when it doesn't see patient 2 in group 4, it yields an NA parameter estimate for this combination. That's fine - the results of the other parameter estimates are still correct. However it's preventing me from estimating a linear function of the model coefficients, and I know that there must be a way to do this even in the presence of the missing combination. Has anyone run into this problem before? Tim -- Timothy Clough Ph.D. Student Department of Statistics Purdue University 765-496-7263
Erik Iverson
2009-Sep-21 18:31 UTC
[R] missing level of a nested factor results in an NA in lm output
> > estimable(fit, myEstimate) > Estimate Std. Error t value DF Pr(>|t|) > test 12.18198 0.6694812 18.19615 10 5.395944e-09Where are you getting this "estimable" function from? A package? Did you define it yourself?