Min-Han Tan
2004-Aug-21 14:44 UTC
[R] A troubled state of freedom: generalized linear models where number of parameters > number of samples
Good morning, Thank you all for your help so far. I really appreciate it. The crux of my problem is that I am generating a generalized linear model with 1 dependent variable, approximately 50 training samples and 100 parameters (gene levels). Essentially, if I have 100 genes and 50 samples, this results in coefficients for the first 49 samples, and NAs for the rest, with an ultra low residual deviance (usually approx. 10^-27). This seems to have something to do with the number of degrees of freedom (since as the number of genes increases up to 49, the number of residual degrees of freedom drops to 0) What kind of methods can I use to make sense of this? I have a subsequent set of samples to work on to validate the results of this glm, so I am not sure if overfitting is really a problem. Background: this is a microarray study, where I have divided the samples in the training set into 2 groups, and generated a number of genes to differentiate between both groups. I am going to use the GLM in a subsequent regression analysis to determine survival. For this purpose, I need to generate some kind of score for each individual case using the coefficients of each gene level * gene expression level. I am not a statistician (but a clinician) - many apologies if I am not conveying myself very clearly here! Thanks. Min-Han Tan