Dear all, i searched for some classification methods and I have no glue if i took the right once. My problem: I have a matrix with 17000 rows and 33 colums (genes and patients). The patients are grouped into 3 diseases. No I want to classify the patients and for sure i want to know which rows are more helpful for the classification than others. I tried SVM and random forest. Do you think this are the right classification methods? Maybe there are some hints you can give me. I am more familiar with the Bioconductor packages. Furthermore: This is/was not my field of study in the past but I want to understand it and I am willing to deal with this field. Would be amazing if one of the (more) mathematical people can give me a hint. Thanks and all the best Peter PS: I can upload my underlying data if somebody is interested
Dear Max, first: Thanks a lot for your suggestion and the open words about methods in real life. I guess: Thats my problem. Regarding my analysis: Yes, thats the problem and I have to coerce to do this analysis regarding lack of time to start something/other methods. So you suggest Linear Discriminant Analysis. Is there a special packages you recommend? Nearest Shrunken Centroids i checked with the package PAMR (http://www-stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html) The example works fine but I guess i have to many rows (or in this case genes) for the analysis. My main problem is that i cannot reduce the amount of the genes because some of the bosses want to compare the output of classification methods with a ruled-based algorithm which works with all genes (after P/A calls and an alternative CDF) on the array. So an reduction of the 17 000 genes is only possible in a limited way (around 7000 genes after some pre-processing steps). For all tips and suggestions I am more than happy. Best Peter Am 19.11.2012 um 16:36 schrieb Max Kuhn <mxkuhn at gmail.com>:> My suggestion is not to do any predictive modeling. Basically, the > data doesn't support a sensible and reproducible model. Yes, the > literature is saturated with this type of analysis but almost none of > the examples have any utility in real life. > > Stick to differential expression analysis, investigate the results > statistically and biologically then design a prospective experiment > with a specific set of genes and a more refined measurement system. > > If you are doing this analysis to learn something from the data (as > opposed to generating accurate predictions), a predictive model is one > of the worst ways of going about it. > > If you are coerced to do this analysis, stick to linear methods > (regularized LDA, nearest shrunken centroids, etc) that are less > likely to over-fit and bias yourself towards those that have embedded > feature selection. > > Max > > > On Mon, Nov 19, 2012 at 10:16 AM, Peter Kupfer <peter.kupfer at me.com> wrote: >> Dear all, >> i searched for some classification methods and I have no glue if i took the right once. >> My problem: I have a matrix with 17000 rows and 33 colums (genes and patients). The patients are grouped into 3 diseases. >> No I want to classify the patients and for sure i want to know which rows are more helpful for the classification than others. >> >> I tried SVM and random forest. Do you think this are the right classification methods? Maybe there are some hints you can give me. I am more familiar with the Bioconductor packages. Furthermore: This is/was not my field of study in the past but I want to understand it and I am willing to deal with this field. >> Would be amazing if one of the (more) mathematical people can give me a hint. >> Thanks and all the best >> >> Peter >> >> >> PS: I can upload my underlying data if somebody is interested >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Max
Dear Peter, There are several packages that try to address this type of problem (although the remarks made by Max are something that we should always keep in mind), and I also recommend those with perform some form of regularized, penalized or shrunken linear discriminant analysis with a preliminary variable selection step . You can take a look at the hda, rda, sda, SDDA, HDclassif or my own HiDimDA, packages for some of the most important alternatives. Hope this helps. Best, Pedro Pedro Duarte Silva Associate Professor of Statistics and Operations Research Faculdade de Economia e Gest?o Universidade Cat?lica Portuguesa / Porto www.feg.porto.ucp.pt Date: Mon, 19 Nov 2012 20:53:10 +0100 From: Peter Kupfer <peter.kupfer at me.com> To: Max Kuhn <mxkuhn at gmail.com> Cc: "r-help at r-project.org" <r-help at r-project.org> Subject: Re: [R] Classification methods - which one? Message-ID: <ED56664A-E8EF-4733-A12B-35117F347CC6 at me.com> Content-Type: text/plain; CHARSET=US-ASCII Dear Max, first: Thanks a lot for your suggestion and the open words about methods in real life. I guess: Thats my problem. Regarding my analysis: Yes, thats the problem and I have to coerce to do this analysis regarding lack of time to start something/other methods. So you suggest Linear Discriminant Analysis. Is there a special packages you recommend? Nearest Shrunken Centroids i checked with the package PAMR (http://www-stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html) The example works fine but I guess i have to many rows (or in this case genes) for the analysis. My main problem is that i cannot reduce the amount of the genes because some of the bosses want to compare the output of classification methods with a ruled-based algorithm which works with all genes (after P/A calls and an alternative CDF) on the array. So an reduction of the 17 000 genes is only possible in a limited way (around 7000 genes after some pre-processing steps). For all tips and suggestions I am more than happy. Best Peter Am 19.11.2012 um 16:36 schrieb Max Kuhn <mxkuhn at gmail.com>:> My suggestion is not to do any predictive modeling. Basically, the > data doesn't support a sensible and reproducible model. Yes, the > literature is saturated with this type of analysis but almost none of > the examples have any utility in real life. > > Stick to differential expression analysis, investigate the results > statistically and biologically then design a prospective experiment > with a specific set of genes and a more refined measurement system. > > If you are doing this analysis to learn something from the data (as > opposed to generating accurate predictions), a predictive model is one > of the worst ways of going about it. > > If you are coerced to do this analysis, stick to linear methods > (regularized LDA, nearest shrunken centroids, etc) that are less > likely to over-fit and bias yourself towards those that have embedded > feature selection. > > Max > > > On Mon, Nov 19, 2012 at 10:16 AM, Peter Kupfer <peter.kupfer at me.com> wrote: >> Dear all, >> i searched for some classification methods and I have no glue if i took the right once. >> My problem: I have a matrix with 17000 rows and 33 colums (genes and patients). The patients are grouped into 3 diseases. >> No I want to classify the patients and for sure i want to know which rows are more helpful for the classification than others. >> >> I tried SVM and random forest. Do you think this are the right classification methods? Maybe there are some hints you can give me. I am more familiar with the Bioconductor packages. Furthermore: This is/was not my field of study in the past but I want to understand it and I am willing to deal with this field. >> Would be amazing if one of the (more) mathematical people can give me a hint. >> Thanks and all the best >> >> Peter >> >> >> PS: I can upload my underlying data if somebody is interested >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.________________________________ AVISO DE CONFIDENCIALIDADE\ Esta mensagem (incluindo qua...{{dropped:16}}
Possibly Parallel Threads
- caret package: arguments passed to the classification or regression routine
- obtaining ROC curve from Nearest Shrunken Centroids (pamr)
- cross validation and parameter determination
- low level plotting question on R
- classification algorithms with distance matrix