Thanks for the advice. My question is more on how to do this? Let me use a biology gene analysis example to illustrate: In biology, there are always some house keeping genes which differ little even at pathological conditions. We know that at different batches, there are external factors affect the measurements. For example, overall signal intensity might be different due to lab reagents. A simplified picture: Day 1: Using control samples, I have measured #1 to #110 genes and get data. Day 2: Using disease samples, I have measured again #1 to #110 genes and get data. For those two data sets, I noticed the overall signal intensity in Day 1, for each gene, is more than Day 2. I know, from biological literature, gene 101 to 110, are "house keeping" genes, should not change much between disease and control. My questions arise, technically, how do I use gene 101 to 110 values to adjust the signals of gene 1 to 100 such that the batch effect can be corrected. The differences revealing from the comparative analysis of 1 ~ 100 genes between disease and control are due to biology rather than lab artifacts. So the question is how to do that mathematically? If I have only one house keeping gene, then I can divide every gene to that to normalize, then compare. But now I have 10 genes which can be utilized for normalization. I assume, the more reference genes to be used, the better, under this context. Can you help again? Thanks much in advance. Waverley wrote:> Hi, > > I have a question of the method as how to normalize the data sets > according to a set of the internal measurements. > > For example, I have performed two batches of experiments contrasting > two different conditions (positive versus negative conditions): one at > a time. > > 1. each experiment, I measure signals of variable v1 to v100. I want > to understand v1 to v100 change under these two contrasting conditions > > 2. Also I know different variables v101 to v1110, total of 10 of them, > although they are different from each other, but they would of the > same or similar values under these two contrasting conditions > > 3. How do I do the internal normalization? How can I use the the > variable v101 to v110 values to normalize the measures of v1 to v100 > at either positive or negative condition to minimize batch effect? I > hope the comparisons of values (v1 to v100) between two different > conditions can be more accurate and robust to external noises. > > In general, I have a couple of matrices of the same dimensions and a > reference matrix of values to be used as reference values to be > normalize to. How should I do that? >I don't understand your problem well, but in general internal normalization is by and large an attempt to avoid appropriate modeling (e.g., incorporating block effects or certain covariates in a regression model), and results in overstated confidence of the final estimates by not taking into account the imprecision in the normalizing factors. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University -- Waverley @ Palo Alto
you should better ask this question on the Bioconductor mailing list. For qPCR normalisation strategies take a look at http://www.gene-quantification.info/ Best, Matthias Waverley wrote:> Thanks for the advice. My question is more on how to do this? > > Let me use a biology gene analysis example to illustrate: > In biology, there are always some house keeping genes which differ > little even at pathological conditions. > > We know that at different batches, there are external factors affect > the measurements. For example, overall signal intensity might be > different due to lab reagents. > A simplified picture: > Day 1: Using control samples, I have measured #1 to #110 genes and get data. > Day 2: Using disease samples, I have measured again #1 to #110 genes > and get data. > > For those two data sets, I noticed the overall signal intensity in Day > 1, for each gene, is more than Day 2. > I know, from biological literature, gene 101 to 110, are "house > keeping" genes, should not change much between disease and control. > My questions arise, technically, how do I use gene 101 to 110 values > to adjust the signals of gene 1 to 100 such that the batch effect can > be corrected. The differences revealing from the comparative analysis > of 1 ~ 100 genes between disease and control are due to biology rather > than lab artifacts. > > So the question is how to do that mathematically? If I have only one > house keeping gene, then I can divide every gene to that to normalize, > then compare. But now I have 10 genes which can be utilized for > normalization. I assume, the more reference genes to be used, the > better, under this context. > > Can you help again? > > Thanks much in advance. > > > Waverley wrote: > >> Hi, >> >> I have a question of the method as how to normalize the data sets >> according to a set of the internal measurements. >> >> For example, I have performed two batches of experiments contrasting >> two different conditions (positive versus negative conditions): one at >> a time. >> >> 1. each experiment, I measure signals of variable v1 to v100. I want >> to understand v1 to v100 change under these two contrasting conditions >> >> 2. Also I know different variables v101 to v1110, total of 10 of them, >> although they are different from each other, but they would of the >> same or similar values under these two contrasting conditions >> >> 3. How do I do the internal normalization? How can I use the the >> variable v101 to v110 values to normalize the measures of v1 to v100 >> at either positive or negative condition to minimize batch effect? I >> hope the comparisons of values (v1 to v100) between two different >> conditions can be more accurate and robust to external noises. >> >> In general, I have a couple of matrices of the same dimensions and a >> reference matrix of values to be used as reference values to be >> normalize to. How should I do that? >> >> > > I don't understand your problem well, but in general internal > normalization is by and large an attempt to avoid appropriate modeling > (e.g., incorporating block effects or certain covariates in a regression > model), and results in overstated confidence of the final estimates by > not taking into account the imprecision in the normalizing factors. > > Frank >-- Dr. Matthias Kohl www.stamats.de
Frank E Harrell Jr
2009-Mar-02 13:04 UTC
[R] How to normalize to a set of internal references
Waverley wrote:> Thanks for the advice. My question is more on how to do this? > > Let me use a biology gene analysis example to illustrate: > In biology, there are always some house keeping genes which differ > little even at pathological conditions. > > We know that at different batches, there are external factors affect > the measurements. For example, overall signal intensity might be > different due to lab reagents. > A simplified picture: > Day 1: Using control samples, I have measured #1 to #110 genes and get data. > Day 2: Using disease samples, I have measured again #1 to #110 genes > and get data. > > For those two data sets, I noticed the overall signal intensity in Day > 1, for each gene, is more than Day 2. > I know, from biological literature, gene 101 to 110, are "house > keeping" genes, should not change much between disease and control. > My questions arise, technically, how do I use gene 101 to 110 values > to adjust the signals of gene 1 to 100 such that the batch effect can > be corrected. The differences revealing from the comparative analysis > of 1 ~ 100 genes between disease and control are due to biology rather > than lab artifacts. > > So the question is how to do that mathematically? If I have only one > house keeping gene, then I can divide every gene to that to normalize, > then compare. But now I have 10 genes which can be utilized for > normalization. I assume, the more reference genes to be used, the > better, under this context. > > Can you help again? > > Thanks much in advance.That is an inappropriate experimental design that has caused major problems in the biomedical research literature (look up the famous Petricoin fiasco - google for petricoin baggerly; Baggerly discovered the error). You have day and disease completely confounded and no model can correct for that (day and disease are completely collinear). Once you randomize the order of samples to be run and analyzed, you can include day as a blocking factor to adjust for any day effect. If analyzing log intensity, the regression adjustment for day will involve a ratio correction on the original scale. If you are completely correct that the housekeeping genes cannot be disease-related, there is hope for some kind of internal control if you make a strong assumption about the time effect being the same for housekeeping genes as for other genes. But why not just do the proper design? Frank> > > Waverley wrote: >> Hi, >> >> I have a question of the method as how to normalize the data sets >> according to a set of the internal measurements. >> >> For example, I have performed two batches of experiments contrasting >> two different conditions (positive versus negative conditions): one at >> a time. >> >> 1. each experiment, I measure signals of variable v1 to v100. I want >> to understand v1 to v100 change under these two contrasting conditions >> >> 2. Also I know different variables v101 to v1110, total of 10 of them, >> although they are different from each other, but they would of the >> same or similar values under these two contrasting conditions >> >> 3. How do I do the internal normalization? How can I use the the >> variable v101 to v110 values to normalize the measures of v1 to v100 >> at either positive or negative condition to minimize batch effect? I >> hope the comparisons of values (v1 to v100) between two different >> conditions can be more accurate and robust to external noises. >> >> In general, I have a couple of matrices of the same dimensions and a >> reference matrix of values to be used as reference values to be >> normalize to. How should I do that? >> > > I don't understand your problem well, but in general internal > normalization is by and large an attempt to avoid appropriate modeling > (e.g., incorporating block effects or certain covariates in a regression > model), and results in overstated confidence of the final estimates by > not taking into account the imprecision in the normalizing factors. > > Frank-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University