Dear Peter,
There are several packages that try to address this type of problem (although
the remarks made
by Max are something that we should always keep in mind), and I also recommend
those with perform
some form of regularized, penalized or shrunken linear discriminant analysis
with a preliminary variable
selection step .
You can take a look at the hda, rda, sda, SDDA, HDclassif or my own HiDimDA,
packages for some of the
most important alternatives.
Hope this helps.
Best,
Pedro
Pedro Duarte Silva
Associate Professor of Statistics and Operations Research
Faculdade de Economia e Gest?o
Universidade Cat?lica Portuguesa / Porto
www.feg.porto.ucp.pt
Date: Mon, 19 Nov 2012 20:53:10 +0100
From: Peter Kupfer <peter.kupfer at me.com>
To: Max Kuhn <mxkuhn at gmail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Subject: Re: [R] Classification methods - which one?
Message-ID: <ED56664A-E8EF-4733-A12B-35117F347CC6 at me.com>
Content-Type: text/plain; CHARSET=US-ASCII
Dear Max,
first: Thanks a lot for your suggestion and the open words about methods in real
life. I guess: Thats my problem.
Regarding my analysis: Yes, thats the problem and I have to coerce to do this
analysis regarding lack of time to start something/other methods.
So you suggest Linear Discriminant Analysis. Is there a special packages you
recommend? Nearest Shrunken Centroids i checked with the package PAMR
(http://www-stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html)
The example works fine but I guess i have to many rows (or in this case genes)
for the analysis. My main problem is that i cannot reduce the amount of the
genes because some of the bosses want to compare the output of classification
methods with a ruled-based algorithm which works with all genes (after P/A calls
and an alternative CDF) on the array. So an reduction of the 17 000 genes is
only possible in a limited way (around 7000 genes after some pre-processing
steps).
For all tips and suggestions I am more than happy.
Best
Peter
Am 19.11.2012 um 16:36 schrieb Max Kuhn <mxkuhn at gmail.com>:
> My suggestion is not to do any predictive modeling. Basically, the
> data doesn't support a sensible and reproducible model. Yes, the
> literature is saturated with this type of analysis but almost none of
> the examples have any utility in real life.
>
> Stick to differential expression analysis, investigate the results
> statistically and biologically then design a prospective experiment
> with a specific set of genes and a more refined measurement system.
>
> If you are doing this analysis to learn something from the data (as
> opposed to generating accurate predictions), a predictive model is one
> of the worst ways of going about it.
>
> If you are coerced to do this analysis, stick to linear methods
> (regularized LDA, nearest shrunken centroids, etc) that are less
> likely to over-fit and bias yourself towards those that have embedded
> feature selection.
>
> Max
>
>
> On Mon, Nov 19, 2012 at 10:16 AM, Peter Kupfer <peter.kupfer at
me.com> wrote:
>> Dear all,
>> i searched for some classification methods and I have no glue if i took
the right once.
>> My problem: I have a matrix with 17000 rows and 33 colums (genes and
patients). The patients are grouped into 3 diseases.
>> No I want to classify the patients and for sure i want to know which
rows are more helpful for the classification than others.
>>
>> I tried SVM and random forest. Do you think this are the right
classification methods? Maybe there are some hints you can give me. I am more
familiar with the Bioconductor packages. Furthermore: This is/was not my field
of study in the past but I want to understand it and I am willing to deal with
this field.
>> Would be amazing if one of the (more) mathematical people can give me a
hint.
>> Thanks and all the best
>>
>> Peter
>>
>>
>> PS: I can upload my underlying data if somebody is interested
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
________________________________
AVISO DE CONFIDENCIALIDADE\ Esta mensagem (incluindo qua...{{dropped:16}}