Dear All, This is not strictly R related (though I would implement the solution in R; besides, being this list so helpful for these kinds of stats questions...). I got a "strange" request from a colleage. He has a bunch (approx. 25000) subjects that belong to one of 12 possible classes. In addition, there are 8 covariates (factors) that can take as values either "absence" or "presence". Some of the subjects only have one covariate with value "presence" (the other covariates being absent), but many of the subjects have more than one covariate with a value of "presence". My colleague wants each subject with more than one "presence" covariate to have one, and only one, covariate be "presence" (of course, the final "present" covariate would belong to the original "present" covars. for that subject): in other words, each subject would be characterized by only one covariate. This "selection of covariates for each subject" (or eliminating covariates for each subject) has to be done in a way that maximizes the correct classification of class based on the presence/absence of covariates. (His reason for doing this is that this simplifies further analyses and decission-making; I tried to explain that with 12 classes and 8 covariates where each subject only has one "presence" covar we would not be able to do a great job predicting class memebership, but he insists the one-covar-per-subject is essential). I thought about a couple of approaches (see details below) but none seem very satisfactory. This issue keeps reminding me of things such as the LASSO and other shrinkage methods, but the twist here is that it is not the beta for a covariate, but different covars in each subject which are made zero. Is there any obvious solution I am missing? Any suggestions? Thanks, ************ Approach 1: the final statistic to judge predictive quality is Goodman & Kruskal's tau (or concentration coefficient) for IxJ contingency tables. Since for every subject with m "present" covars, there are m possible contingency tables, and there are many subjects with multiple present covars, there is an astronomical number of possible contingency tables, and we can not do an exahustively search (nor do I see an obvious way to simplify the problem from tau's definition, because we have 12 categories to predict based on the 8 covars). I would use a genetic algorithm to try and find a decent solution. Approach 2: set this up as a multinomial loglinear model. Fit it (using multinom) to the original data set. Do not make the covars as factors but code present as 1 and absent as 0. For each subject with several (say, k) "present" covars, predict the class membership (predict.multinom) for each of the k covar. vectors obtained after subtracting, say, 0.1, from each of the covariates (except 1) with value non-zero. Set as the new covariate vector for that subject the one that gives the highest predicted probability to the right class. Repeat the model fitting and modify covariates as in the last step (re-escaling at the end, so that the max. covar. value is always one for each subject) until there is only one non-zero covar. (If there ever is!). This seems to me like a very clumsy approach, and I am not sure if there is any reason for it to arrive at a reasonable solution; I thought it could be a way of smoothly moving, within subject, each covariate (except one) "along its path of least resistance" to a value of zero. (Note: in both approaches further simplification can be achieved by applying the same transformation or mutation ---with ga--- to all subjects that belong to the same class and have the same initial configuration of covariates. This way I also forcefully prevent identical subjects to end up with different final configurations). -- Ram?n D?az-Uriarte Unidad de Bioinform?tica Centro Nacional de Investigaciones Oncol?gicas (CNIO) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) http://bioinfo.cnio.es -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martyn Plummer
2002-Jun-19 09:32 UTC
[R] best selection of covariates (for each individual)
On 19-Jun-2002 Ramon Diaz wrote:> Dear All, > > This is not strictly R related (though I would implement the solution in R; > besides, being this list so helpful for these kinds of stats questions...). > > I got a "strange" request from a colleage. He has a bunch (approx. 25000) > subjects that belong to one of 12 possible classes. In addition, there are 8 > covariates (factors) that can take as values either "absence" or "presence". > Some of the subjects only have one covariate with value "presence" (the other > covariates being absent), but many of the subjects have more than one > covariate with a value of "presence". > My colleague wants each subject with more than one "presence" covariate to > have one, and only one, covariate be "presence" (of course, the final > "present" covariate would belong to the original "present" covars. for that > subject): in other words, each subject would be characterized by only one > covariate. This "selection of covariates for each subject" (or eliminating > covariates for each subject) has to be done in a way that maximizes the > correct classification of class based on the presence/absence of covariates. > > (His reason for doing this is that this simplifies further analyses and > decission-making; I tried to explain that with 12 classes and 8 covariates > where each subject only has one "presence" covar we would not be able to do a > great job predicting class memebership, but he insists the > one-covar-per-subject is essential).Well don't help him any more then! One of the perils of being an applied statistician is that one's colleagues may (and often do) suggest ad hoc solutions that make the analysis more complicated. In this case, your colleague seems to have turned what appears to be a straightforward problem into an insoluble one. If he won't listen to reason, I don't see why you should be obliged work around him. I know this doesn't answer your question, but if the problem is to get the best classification of the subjects into 12 classes (where the class of each subject is known) based on 8 binary covariates, then it can be handled easily by a classification tree. Trying to persuade your colleague to use this approach will be more fruitful than working around his prejudices, especially in the long run. I'm feeling grumpy today (as you can probably tell). My apologies. Martyn> > I thought about a couple of approaches (see details below) but none seem very > satisfactory. This issue keeps reminding me of things such as the LASSO and > other shrinkage methods, but the twist here is that it is not the beta for a > covariate, but different covars in each subject which are made zero. > > Is there any obvious solution I am missing? Any suggestions? > > Thanks, > > ************ > Approach 1: the final statistic to judge predictive quality is Goodman & > Kruskal's tau (or concentration coefficient) for IxJ contingency tables. > Since for every subject with m "present" covars, there are m possible > contingency tables, and there are many subjects with multiple present covars, > there is an astronomical number of possible contingency tables, and we can > not do an exahustively search (nor do I see an obvious way to simplify the > problem from tau's definition, because we have 12 categories to predict based > on the 8 covars). I would use a genetic algorithm to try and find a decent > solution. > > > Approach 2: set this up as a multinomial loglinear model. Fit it (using > multinom) to the original data set. Do not make the covars as factors but > code present as 1 and absent as 0. > For each subject with several (say, k) "present" covars, predict the class > membership (predict.multinom) for each of the k covar. vectors obtained after > subtracting, say, 0.1, from each of the covariates (except 1) with value > non-zero. Set as the new covariate vector for that subject the one that gives > the highest predicted probability to the right class. > Repeat the model fitting and modify covariates as in the last step > (re-escaling at the end, so that the max. covar. value is always one for each > subject) until there is only one non-zero covar. (If there ever is!). > This seems to me like a very clumsy approach, and I am not sure if there is > any reason for it to arrive at a reasonable solution; I thought it could be a > way of smoothly moving, within subject, each covariate (except one) "along > its path of least resistance" to a value of zero. > > (Note: in both approaches further simplification can be achieved by applying > the same transformation or mutation ---with ga--- to all subjects that belong > to the same class and have the same initial configuration of covariates. This > way I also forcefully prevent identical subjects to end up with different > final configurations).-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._