thr3ads.net - R help - [R] Newbie clustering/classification question [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Mark A. Miller

2006-Mar-26 03:27 UTC

[R] Newbie clustering/classification question

My laboratory is measuring the abundance of various proteins in the
blood from either healthy individuals or from individuals with various
diseases.  I would like to determine which proteins, if any, have
significantly different abundances between the healthy and diseased
individuals.  Currently, one of my colleagues is performing an ANOVA on
each protein with MS Excel.  I would like to analyze the data sets with
a scriptable tool, like R.  I could use another tool, but I am trying
to stick to open source.  I have basic procedural programming skills (I
do a lot of PHP/MySQL), but I'm not very good with anything that
requires thinking in vectors and matrices. 
	One approach I'm imagining is looping through all of the columns and
doing an ANOVA, like my colleague is doing manually.  I have heard
other people in my field talking about other tests for this kind of
data.  Would a Kruskal-Wallis test, hierarchical data clustering,
principal component analysis, or random forests be appropriate for the
question I am asking?  If so, how would I write a reusable script for
the test?  The data table will always have the same basic structure,
but the number of proteins could vary, as could the number of
conditions or the number of repeats within each condition.
	I especially want to export the results of this test in a format
roughly like the example below.  (I'd like the mean of each protein's
abundance for each condition, some measure of variability within each
condition, and a measure of significance for whether the protein
abundances are different between conditions.)  I have gotten to the
point of doing an ANOVA on a single protein R and viewing the results
interactively, but I have no idea how to analyze the differences for
all of the proteins (in a loop, or all at once) or how to save the
results to a file.
Any suggestions?

Example input (tab delimited)
condition	protA	protB	protC	protD	protE	protF	protG	protH
healthy1	11111	22222	33333	70681	61735	66666	77777	88888
healthy1	12121	21111	32132	57230	69715	67890	87878	98989
healthy1	10101	20202	30303	67223	51967	65656	78900	111111
healthy2	12345	23111	32100	65931	67650	60001	80001	101010
healthy2	13333	21231	34111	58761	54086	60002	80002	122222
healthy2	13232	20101	30009	68752	70360	60003	80003	91919
asthma	32132	19889	30733	59959	71783	60237	65603	20374
asthma	34344	20483	31182	70531	59630	40445	56370	98404
asthma	39999	20464	29793	58395	66976	50577	39908	65367
diabetes	10000	20102	29486	51260	68447	42960	50875	216227
diabetes	10111	19143	31275	52573	55459	71337	53090	151505
diabetes	10001	21790	31470	54222	57318	64058	44166	207427
diabetes	15555	20123	30131	59882	71191	46203	44633	197430
acne	12222	31221	51381	64431	55016	43463	60388	74243
acne	12221	30535	49199	61419	65096	71551	41811	104317
acne	10001	30649	49199	56731	69871	61816	44321	125068


Desired output
condition	protA	protB	protC	protD	protE	protF	protG	protH
healthy1.mean								
healthy1.sd								
healthy1.pval								
healthy2.mean								
healthy2.sd								
healthy2.pval								
asthma.mean								
asthma.sd								
asthma.pval								
diabetes.mean								
diabetes.sd								
diabetes.pval								
acne.mean								
acne.sd								
acne.pval								



---   ---   ---   ---   ---   ---   ---   ---

Mark A. Miller

Sean Davis

2006-Mar-26 11:48 UTC

head link

[R] Newbie clustering/classification question

Mark A. Miller wrote:> 	My laboratory is measuring the abundance of various proteins in the
> blood from either healthy individuals or from individuals with various
> diseases.  I would like to determine which proteins, if any, have
> significantly different abundances between the healthy and diseased
> individuals.  Currently, one of my colleagues is performing an ANOVA on
> each protein with MS Excel.  I would like to analyze the data sets with
> a scriptable tool, like R.  I could use another tool, but I am trying
> to stick to open source.  I have basic procedural programming skills (I
> do a lot of PHP/MySQL), but I'm not very good with anything that
> requires thinking in vectors and matrices. 
> 	One approach I'm imagining is looping through all of the columns and
> doing an ANOVA, like my colleague is doing manually.  I have heard
> other people in my field talking about other tests for this kind of
> data.  Would a Kruskal-Wallis test, hierarchical data clustering,
> principal component analysis, or random forests be appropriate for the
> question I am asking?  If so, how would I write a reusable script for
> the test?  The data table will always have the same basic structure,
> but the number of proteins could vary, as could the number of
> conditions or the number of repeats within each condition.
> 	I especially want to export the results of this test in a format
> roughly like the example below.  (I'd like the mean of each
protein's
> abundance for each condition, some measure of variability within each
> condition, and a measure of significance for whether the protein
> abundances are different between conditions.)  I have gotten to the
> point of doing an ANOVA on a single protein R and viewing the results
> interactively, but I have no idea how to analyze the differences for
> all of the proteins (in a loop, or all at once) or how to save the
> results to a file.
> Any suggestions?
>
> Example input (tab delimited)
> condition	protA	protB	protC	protD	protE	protF	protG	protH
> healthy1	11111	22222	33333	70681	61735	66666	77777	88888
> healthy1	12121	21111	32132	57230	69715	67890	87878	98989
> healthy1	10101	20202	30303	67223	51967	65656	78900	111111
> healthy2	12345	23111	32100	65931	67650	60001	80001	101010
> healthy2	13333	21231	34111	58761	54086	60002	80002	122222
> healthy2	13232	20101	30009	68752	70360	60003	80003	91919
> asthma	32132	19889	30733	59959	71783	60237	65603	20374
> asthma	34344	20483	31182	70531	59630	40445	56370	98404
> asthma	39999	20464	29793	58395	66976	50577	39908	65367
> diabetes	10000	20102	29486	51260	68447	42960	50875	216227
> diabetes	10111	19143	31275	52573	55459	71337	53090	151505
> diabetes	10001	21790	31470	54222	57318	64058	44166	207427
> diabetes	15555	20123	30131	59882	71191	46203	44633	197430
> acne	12222	31221	51381	64431	55016	43463	60388	74243
> acne	12221	30535	49199	61419	65096	71551	41811	104317
> acne	10001	30649	49199	56731	69871	61816	44321	125068
>
>
> Desired output
> condition	protA	protB	protC	protD	protE	protF	protG	protH
> healthy1.mean								
> healthy1.sd								
> healthy1.pval								
> healthy2.mean								
> healthy2.sd								
> healthy2.pval								
> asthma.mean								
> asthma.sd								
> asthma.pval								
> diabetes.mean								
> diabetes.sd								
> diabetes.pval								
> acne.mean								
> acne.sd								
> acne.pval								
>   Hi, Mark.  With data like these, you will want to look at the 
BioConductor (http://www.bioconductor.org) project.  If you transpose 
your matrix so that individuals are in columns and proteins are in rows, 
then you have data in exactly the same form as a microarray analysis, so 
most of the tools in BioConductor will apply.  In addition, there are 
tools specifically designed for mass-spec data.  For your question 
directly, look at the limma package; it will do a protein-by-protein 
anova for you.  There is an extensive user guide available.

Sean

Seemingly Similar Threads

Search for more maybe matching threads

R help - Mar 2006 - Newbie clustering/classification question

[R] Newbie clustering/classification question

[R] Newbie clustering/classification question

Seemingly Similar Threads