Ott Toomet
2003-May-30 08:37 UTC
[R] Coefficients: (20 not defined because of singularities)
Hi, "singularity" in this case means that your X'X matrix is singular, i.e. you have multicollinearity in your data. A common reasons is selecting observations with a particular binary feature (e.g. only women) and then including a control variable for the same feature (e.g. including child*women cross effect). You seem to be working with the continuous variables, so this may not be the case. A way to check collinearity is using condition numbers (look kappa() in R). First, make the model matrix (you may use model.matrix() but if you have only variables and no special effects, you may use cbind() instead). Then take a single column out of the matrix and calculate the condition number (this is definitely 1). Now add the second column, and calculate again. Print out condition numbers, corresponding to the number of columns you used. You should see where the number explodes, it means corresponding variable is collinear with some of the previous ones. Perhaps it helps. Ott | From: Thomas Fischer <th.fischer at gmx.net> | Date: Fri, 30 May 2003 10:44:55 +0200 | | Hello, | | I am trying to run a linear regression analysis on my data set. For some | reason most variables are removed due to singularities. | | My linear regression looks this way (I am using only partial data, which | is selected by flags): | | fm<-lm(log(cplex6.time..sec..[flags]) ~ cplex6.cities[flags] + | log(1/features.meanOver.frust[flags]) + | log(1/features.meanOver.minDist[flags]) + | The summary of one of the removed coefficients looks like this: | | > summary(features.spanOver.quart1SpanDist[flags]) | Min. 1st Qu. Median Mean 3rd Qu. Max. | 0.05584 0.05797 0.06366 0.06311 0.06674 0.07290 | > summary(log(1/features.spanOver.quart1SpanDist[flags])) | Min. 1st Qu. Median Mean 3rd Qu. Max. | 2.619 2.707 2.754 2.767 2.848 2.885 | | The summary of a coefficient that was kept looks this way: | | > summary(features.quant25Over.minDist[flags]) | Min. 1st Qu. Median Mean 3rd Qu. Max. | 0.001030 0.001030 0.001030 0.001032 0.001030 0.001040 | > summary(log(1/features.quant25Over.minDist[flags])) | Min. 1st Qu. Median Mean 3rd Qu. Max. | 6.869 6.878 6.878 6.877 6.878 6.878 | | So, I don't see the difference. Why has the first coefficient been | removed and the second one kept? | Please help me. | | I'm using R 1.6.2 on a Linux x86 machine. | | Greetings, | Thomas Fischer
Thomas Fischer
2003-May-30 08:44 UTC
[R] Coefficients: (20 not defined because of singularities)
Hello, I am trying to run a linear regression analysis on my data set. For some reason most variables are removed due to singularities. My linear regression looks this way (I am using only partial data, which is selected by flags): fm<-lm(log(cplex6.time..sec..[flags]) ~ cplex6.cities[flags] + log(1/features.meanOver.frust[flags]) + log(1/features.meanOver.minDist[flags]) + [...] avg..steps.to.loc..Opt..norm..[flags] + NN.List.opt..tour.max.[flags]) As I am using inversion and logarithms I set all data to positiv values, before running lm(): cplex6.time..sec..[cplex6.time..sec..<=0.00001]=0.00001 features.meanOver.frust[features.meanOver.frust<=0.00001]=0.00001 features.meanOver.minDist[features.meanOver.minDist<=0.00001]=0.00001 [...] features.varOver.varDist[features.varOver.varDist<=0.00001]=0.00001 Retrieving the summary of fm, I get the message, that some coefficients have been removed. [...] Coefficients: (20 not defined because of singularities) Estimate Std. Error t value (Intercept) 87.2162 44.1148 1.977 log(1/features.meanOver.frust[flags]) -2.5298 0.1515 -16.702 log(1/features.meanOver.minDist[flags]) 154.7170 11.3917 13.582 log(1/features.meanOver.quant25Dist[flags]) -943.4625 71.3505 -13.223 log(1/features.meanOver.quart1SpanDist[flags]) 776.1049 60.0571 12.923 log(1/features.meanOver.spanDist[flags]) -9.8069 0.1400 -70.038 log(1/features.meanOver.varDist[flags]) -11.3211 0.6715 -16.859 log(1/features.quant25Over.minDist[flags]) -46.9655 3.1438 -14.939 avg..steps.to.loc..Opt..norm..[flags] 0.8324 1.0919 0.762 Pr(>|t|) (Intercept) 0.0511 . log(1/features.meanOver.frust[flags]) <2e-16 *** log(1/features.meanOver.minDist[flags]) <2e-16 *** log(1/features.meanOver.quant25Dist[flags]) <2e-16 *** log(1/features.meanOver.quart1SpanDist[flags]) <2e-16 *** log(1/features.meanOver.spanDist[flags]) <2e-16 *** log(1/features.meanOver.varDist[flags]) <2e-16 *** log(1/features.quant25Over.minDist[flags]) <2e-16 *** avg..steps.to.loc..Opt..norm..[flags] 0.4478 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 [...] The summary of one of the removed coefficients looks like this:> summary(features.spanOver.quart1SpanDist[flags])Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05584 0.05797 0.06366 0.06311 0.06674 0.07290> summary(log(1/features.spanOver.quart1SpanDist[flags]))Min. 1st Qu. Median Mean 3rd Qu. Max. 2.619 2.707 2.754 2.767 2.848 2.885 The summary of a coefficient that was kept looks this way:> summary(features.quant25Over.minDist[flags])Min. 1st Qu. Median Mean 3rd Qu. Max. 0.001030 0.001030 0.001030 0.001032 0.001030 0.001040> summary(log(1/features.quant25Over.minDist[flags]))Min. 1st Qu. Median Mean 3rd Qu. Max. 6.869 6.878 6.878 6.877 6.878 6.878 So, I don't see the difference. Why has the first coefficient been removed and the second one kept? Please help me. I'm using R 1.6.2 on a Linux x86 machine. Greetings, Thomas Fischer
Prof Brian Ripley
2003-May-30 09:06 UTC
[R] Coefficients: (20 not defined because of singularities)
It is the model matrix which is singular, *not* the variable. You are trying to fit a collinear model. Use alias() to see what is going on. On Fri, 30 May 2003, Thomas Fischer wrote:> Hello, > > I am trying to run a linear regression analysis on my data set. For some > reason most variables are removed due to singularities. > > My linear regression looks this way (I am using only partial data, which > is selected by flags): > > fm<-lm(log(cplex6.time..sec..[flags]) ~ cplex6.cities[flags] + > log(1/features.meanOver.frust[flags]) + > log(1/features.meanOver.minDist[flags]) + > [...] > avg..steps.to.loc..Opt..norm..[flags] + NN.List.opt..tour.max.[flags]) > > As I am using inversion and logarithms I set all data to positiv values, > before running lm(): > > cplex6.time..sec..[cplex6.time..sec..<=0.00001]=0.00001 > features.meanOver.frust[features.meanOver.frust<=0.00001]=0.00001 > features.meanOver.minDist[features.meanOver.minDist<=0.00001]=0.00001 > [...] > features.varOver.varDist[features.varOver.varDist<=0.00001]=0.00001 > > Retrieving the summary of fm, I get the message, that some coefficients > have been removed.No, that they are nor defined, as it says.> [...] > Coefficients: (20 not defined because of singularities) > Estimate Std. Error t > value > (Intercept) 87.2162 44.1148 > 1.977 > log(1/features.meanOver.frust[flags]) -2.5298 0.1515 > -16.702 > log(1/features.meanOver.minDist[flags]) 154.7170 11.3917 > 13.582 > log(1/features.meanOver.quant25Dist[flags]) -943.4625 71.3505 > -13.223 > log(1/features.meanOver.quart1SpanDist[flags]) 776.1049 60.0571 > 12.923 > log(1/features.meanOver.spanDist[flags]) -9.8069 0.1400 > -70.038 > log(1/features.meanOver.varDist[flags]) -11.3211 0.6715 > -16.859 > log(1/features.quant25Over.minDist[flags]) -46.9655 3.1438 > -14.939 > avg..steps.to.loc..Opt..norm..[flags] 0.8324 1.0919 > 0.762 > Pr(>|t|) > (Intercept) 0.0511 . > log(1/features.meanOver.frust[flags]) <2e-16 *** > log(1/features.meanOver.minDist[flags]) <2e-16 *** > log(1/features.meanOver.quant25Dist[flags]) <2e-16 *** > log(1/features.meanOver.quart1SpanDist[flags]) <2e-16 *** > log(1/features.meanOver.spanDist[flags]) <2e-16 *** > log(1/features.meanOver.varDist[flags]) <2e-16 *** > log(1/features.quant25Over.minDist[flags]) <2e-16 *** > avg..steps.to.loc..Opt..norm..[flags] 0.4478 > --- > Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > [...] > > > The summary of one of the removed coefficients looks like this:That's the summary of the variable, not the coefficient.> > summary(features.spanOver.quart1SpanDist[flags]) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.05584 0.05797 0.06366 0.06311 0.06674 0.07290 > > summary(log(1/features.spanOver.quart1SpanDist[flags])) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 2.619 2.707 2.754 2.767 2.848 2.885 > > The summary of a coefficient that was kept looks this way: > > > summary(features.quant25Over.minDist[flags]) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.001030 0.001030 0.001030 0.001032 0.001030 0.001040 > > summary(log(1/features.quant25Over.minDist[flags])) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 6.869 6.878 6.878 6.877 6.878 6.878 > > So, I don't see the difference. Why has the first coefficient been > removed and the second one kept? > Please help me. > > I'm using R 1.6.2 on a Linux x86 machine. > > Greetings, > Thomas Fischer > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595