Sidoti, Salvatore A.
2016-Nov-14 01:37 UTC
[R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics
Hi Jim, Nice to see you again! First of all, apologies to all for bending the rules a bit with respect to the mailing list. I know this is a list for R programming specifically, and I have received some great advice in this regard in the past. I just thought this was an interesting applied problem that would generate some discussion about PCA in R. Yes, that is an excellent question! Indeed, why not just volume? Since this is still a work in progress and we have not published as of yet, I would rather not be more specific about the type of animal at this time ;>}. Nonetheless, I can say that the animals I study change "size" depending on their feeding and hydration state. The abdomen in particular undergoes drastic size changes. That being said, there are key anatomical features that remain fixed in the adult. Now, there *might* be a way to work volume into the PCA. Although volume is not a reliable metric since the abdomen size is so changeable while the animal is alive, but what about preserved specimens? I have many that have been marinating in ethanol for months. Wouldn't the tissues have equilibrated by now? Probably... I could measure volume by displacement or suspension, I suppose. In the meantime, here's a few thoughts: 1) Use the contribution % (known as C% hereafter) of each variable on principle components 1 and 2. 2) The total contribution of a variable that explains the variations retained by PC1 an PC2 is calculated by: sum(C%1 * eigenvalue1, C%2 * eigenvalue2) 3) Scale() to mean-center the columns of the data set. 4) Use these total contributions as the weights of an arithmetic mean. For example, we have an animal with the following data (mean-centered): weight: 1.334 interoc: -0.225 clength: 0.046 cwidth: -0.847 The contributions of these variables on PC1 and PC2 are (% changed to proportions): weight: 0.556 interoc: 0.357 clength: 0.493 cwidth: 0.291 To calculate size: 1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758 Then divide by the sum of the weights: 0.43758 / 1.697 = 0.257855 = "animal size" This value can then be used to rank the animal according to its size for further analysis... Does this sound like a reasonable application of my PCA data? Salvatore A. Sidoti PhD Student Behavioral Ecology -----Original Message----- From: Jim Lemon [mailto:drjimlemon at gmail.com] Sent: Sunday, November 13, 2016 3:53 PM To: Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu>; r-help mailing list <r-help at r-project.org> Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics Hi Salvatore, If by "size" you mean volume, why not directly measure the volume of your animals? They appear to be fairly small. Sometimes working out what the critical value actually means can inform the way to measure it. Jim On Sun, Nov 13, 2016 at 4:46 PM, Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu> wrote:> Let's say I perform 4 measurements on an animal: three are linear measurements in millimeters and the fourth is its weight in milligrams. So, we have a data set with mixed units. > > Based on these four correlated measurements, I would like to obtain one "score" or value that describes an individual animal's size. I considered simply taking the geometric mean of these 4 measurements, and that would give me a "score" - larger values would be for larger animals, etc. > > However, this assumes that all 4 of these measurements contribute equally to an animal's size. Of course, more than likely this is not the case. I then performed a PCA to discover how much influence each variable had on the overall data set. I was hoping to use this analysis to refine my original approach. > > I honestly do not know how to apply the information from the PCA to this particular problem... > > I do know, however, that principle components 1 and 2 capture enough of the variation to reduce the number of dimensions down to 2 (see analysis below with the original data set). > > Note: animal weights were ln() transformed to increase correlation with the 3 other variables. > > df <- data.frame( > weight = log(1000*c(0.0980, 0.0622, 0.0600, 0.1098, 0.0538, 0.0701, 0.1138, 0.0540, 0.0629, 0.0930, > 0.0443, 0.1115, 0.1157, 0.0734, 0.0616, 0.0640, 0.0480, 0.1339, 0.0547, 0.0844, > 0.0431, 0.0472, 0.0752, 0.0604, 0.0713, 0.0658, 0.0538, 0.0585, 0.0645, 0.0529, > 0.0448, 0.0574, 0.0577, 0.0514, 0.0758, 0.0424, 0.0997, 0.0758, 0.0649, 0.0465, > 0.0748, 0.0540, 0.0819, 0.0732, 0.0725, 0.0730, 0.0777, 0.0630, 0.0466)), > interoc = c(0.853, 0.865, 0.811, 0.840, 0.783, 0.868, 0.818, 0.847, 0.838, 0.799, > 0.737, 0.788, 0.731, 0.777, 0.863, 0.877, 0.814, 0.926, 0.767, 0.746, > 0.700, 0.768, 0.807, 0.753, 0.809, 0.788, 0.750, 0.815, 0.757, 0.737, > 0.759, 0.863, 0.747, 0.838, 0.790, 0.676, 0.857, 0.728, 0.743, 0.870, > 0.787, 0.773, 0.829, 0.785, 0.746, 0.834, 0.829, 0.750, 0.842), > cwidth = c(3.152, 3.046, 3.139, 3.181, 3.023, 3.452, 2.803, 3.050, 3.160, 3.186, > 2.801, 2.862, 3.183, 2.770, 3.207, 3.188, 2.969, 3.033, 2.972, 3.291, > 2.772, 2.875, 2.978, 3.094, 2.956, 2.966, 2.896, 3.149, 2.813, 2.935, > 2.839, 3.152, 2.984, 3.037, 2.888, 2.723, 3.342, 2.562, 2.827, 2.909, > 3.093, 2.990, 3.097, 2.751, 2.877, 2.901, 2.895, 2.721, 2.942), > clength = c(3.889, 3.733, 3.762, 4.059, 3.911, 3.822, 3.768, 3.814, 3.721, 3.794, > 3.483, 3.863, 3.856, 3.457, 3.996, 3.876, 3.642, 3.978, 3.534, 3.967, > 3.429, 3.518, 3.766, 3.755, 3.706, 3.785, 3.607, 3.922, 3.453, 3.589, > 3.508, 3.861, 3.706, 3.593, 3.570, 3.341, 3.916, 3.336, 3.504, 3.688, > 3.735, 3.724, 3.860, 3.405, 3.493, 3.586, 3.545, 3.443, > 3.640)) > > pca_morpho <- princomp(df, cor = TRUE) > > summary(pca_morpho) > > Importance of components: > Comp.1 Comp.2 Comp.3 Comp.4 > Standard deviation 1.604107 0.8827323 0.7061206 0.3860275 > Proportion of Variance 0.643290 0.1948041 0.1246516 0.0372543 > Cumulative Proportion 0.643290 0.8380941 0.9627457 1.0000000 > > Loadings: > Comp.1 Comp.2 Comp.3 Comp.4 > weight -0.371 0.907 -0.201 > interoc -0.486 -0.227 -0.840 > cwidth -0.537 -0.349 0.466 -0.611 > clength -0.582 0.278 0.761 > > Comp.1 Comp.2 Comp.3 Comp.4 > SS loadings 1.00 1.00 1.00 1.00 > Proportion Var 0.25 0.25 0.25 0.25 > Cumulative Var 0.25 0.50 0.75 1.00 > > Any guidance will be greatly appreciated! > > Salvatore A. Sidoti > PhD Student > The Ohio State University > Behavioral Ecology > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2016-Nov-14 02:04 UTC
[R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics
Hi Salvatore, Depending upon your concept of "size" the use of the weighted sum may well suit your purpose. The first principal component, being three lengths and a mass, is likely to be strongly related to any sensible concept of "size". My comment was meant to ensure that the local definition of "size" was what you wanted. Using scaled values is a good idea as it provides an intuitive measure of comparison within the population. Remember that if your animal is long and thin, you have already reduced the importance of the former measurement by scaling. Jim On Mon, Nov 14, 2016 at 12:37 PM, Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu> wrote:> Hi Jim, > > Nice to see you again! First of all, apologies to all for bending the rules a bit with respect to the mailing list. I know this is a list for R programming specifically, and I have received some great advice in this regard in the past. I just thought this was an interesting applied problem that would generate some discussion about PCA in R. > > Yes, that is an excellent question! Indeed, why not just volume? Since this is still a work in progress and we have not published as of yet, I would rather not be more specific about the type of animal at this time ;>}. Nonetheless, I can say that the animals I study change "size" depending on their feeding and hydration state. The abdomen in particular undergoes drastic size changes. That being said, there are key anatomical features that remain fixed in the adult. > > Now, there *might* be a way to work volume into the PCA. Although volume is not a reliable metric since the abdomen size is so changeable while the animal is alive, but what about preserved specimens? I have many that have been marinating in ethanol for months. Wouldn't the tissues have equilibrated by now? Probably... I could measure volume by displacement or suspension, I suppose. > > In the meantime, here's a few thoughts: > > 1) Use the contribution % (known as C% hereafter) of each variable on principle components 1 and 2. > > 2) The total contribution of a variable that explains the variations retained by PC1 an PC2 is calculated by: > > sum(C%1 * eigenvalue1, C%2 * eigenvalue2) > > 3) Scale() to mean-center the columns of the data set. > > 4) Use these total contributions as the weights of an arithmetic mean. > > For example, we have an animal with the following data (mean-centered): > weight: 1.334 > interoc: -0.225 > clength: 0.046 > cwidth: -0.847 > > The contributions of these variables on PC1 and PC2 are (% changed to proportions): > weight: 0.556 > interoc: 0.357 > clength: 0.493 > cwidth: 0.291 > > To calculate size: > 1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758 > Then divide by the sum of the weights: > 0.43758 / 1.697 = 0.257855 = "animal size" > > This value can then be used to rank the animal according to its size for further analysis... > > Does this sound like a reasonable application of my PCA data? > > Salvatore A. Sidoti > PhD Student > Behavioral Ecology > > -----Original Message----- > From: Jim Lemon [mailto:drjimlemon at gmail.com] > Sent: Sunday, November 13, 2016 3:53 PM > To: Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu>; r-help mailing list <r-help at r-project.org> > Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics > > Hi Salvatore, > If by "size" you mean volume, why not directly measure the volume of your animals? They appear to be fairly small. Sometimes working out what the critical value actually means can inform the way to measure it. > > Jim > > > On Sun, Nov 13, 2016 at 4:46 PM, Sidoti, Salvatore A. > <sidoti.23 at buckeyemail.osu.edu> wrote: >> Let's say I perform 4 measurements on an animal: three are linear measurements in millimeters and the fourth is its weight in milligrams. So, we have a data set with mixed units. >> >> Based on these four correlated measurements, I would like to obtain one "score" or value that describes an individual animal's size. I considered simply taking the geometric mean of these 4 measurements, and that would give me a "score" - larger values would be for larger animals, etc. >> >> However, this assumes that all 4 of these measurements contribute equally to an animal's size. Of course, more than likely this is not the case. I then performed a PCA to discover how much influence each variable had on the overall data set. I was hoping to use this analysis to refine my original approach. >> >> I honestly do not know how to apply the information from the PCA to this particular problem... >> >> I do know, however, that principle components 1 and 2 capture enough of the variation to reduce the number of dimensions down to 2 (see analysis below with the original data set). >> >> Note: animal weights were ln() transformed to increase correlation with the 3 other variables. >> >> df <- data.frame( >> weight = log(1000*c(0.0980, 0.0622, 0.0600, 0.1098, 0.0538, 0.0701, 0.1138, 0.0540, 0.0629, 0.0930, >> 0.0443, 0.1115, 0.1157, 0.0734, 0.0616, 0.0640, 0.0480, 0.1339, 0.0547, 0.0844, >> 0.0431, 0.0472, 0.0752, 0.0604, 0.0713, 0.0658, 0.0538, 0.0585, 0.0645, 0.0529, >> 0.0448, 0.0574, 0.0577, 0.0514, 0.0758, 0.0424, 0.0997, 0.0758, 0.0649, 0.0465, >> 0.0748, 0.0540, 0.0819, 0.0732, 0.0725, 0.0730, 0.0777, 0.0630, 0.0466)), >> interoc = c(0.853, 0.865, 0.811, 0.840, 0.783, 0.868, 0.818, 0.847, 0.838, 0.799, >> 0.737, 0.788, 0.731, 0.777, 0.863, 0.877, 0.814, 0.926, 0.767, 0.746, >> 0.700, 0.768, 0.807, 0.753, 0.809, 0.788, 0.750, 0.815, 0.757, 0.737, >> 0.759, 0.863, 0.747, 0.838, 0.790, 0.676, 0.857, 0.728, 0.743, 0.870, >> 0.787, 0.773, 0.829, 0.785, 0.746, 0.834, 0.829, 0.750, 0.842), >> cwidth = c(3.152, 3.046, 3.139, 3.181, 3.023, 3.452, 2.803, 3.050, 3.160, 3.186, >> 2.801, 2.862, 3.183, 2.770, 3.207, 3.188, 2.969, 3.033, 2.972, 3.291, >> 2.772, 2.875, 2.978, 3.094, 2.956, 2.966, 2.896, 3.149, 2.813, 2.935, >> 2.839, 3.152, 2.984, 3.037, 2.888, 2.723, 3.342, 2.562, 2.827, 2.909, >> 3.093, 2.990, 3.097, 2.751, 2.877, 2.901, 2.895, 2.721, 2.942), >> clength = c(3.889, 3.733, 3.762, 4.059, 3.911, 3.822, 3.768, 3.814, 3.721, 3.794, >> 3.483, 3.863, 3.856, 3.457, 3.996, 3.876, 3.642, 3.978, 3.534, 3.967, >> 3.429, 3.518, 3.766, 3.755, 3.706, 3.785, 3.607, 3.922, 3.453, 3.589, >> 3.508, 3.861, 3.706, 3.593, 3.570, 3.341, 3.916, 3.336, 3.504, 3.688, >> 3.735, 3.724, 3.860, 3.405, 3.493, 3.586, 3.545, 3.443, >> 3.640)) >> >> pca_morpho <- princomp(df, cor = TRUE) >> >> summary(pca_morpho) >> >> Importance of components: >> Comp.1 Comp.2 Comp.3 Comp.4 >> Standard deviation 1.604107 0.8827323 0.7061206 0.3860275 >> Proportion of Variance 0.643290 0.1948041 0.1246516 0.0372543 >> Cumulative Proportion 0.643290 0.8380941 0.9627457 1.0000000 >> >> Loadings: >> Comp.1 Comp.2 Comp.3 Comp.4 >> weight -0.371 0.907 -0.201 >> interoc -0.486 -0.227 -0.840 >> cwidth -0.537 -0.349 0.466 -0.611 >> clength -0.582 0.278 0.761 >> >> Comp.1 Comp.2 Comp.3 Comp.4 >> SS loadings 1.00 1.00 1.00 1.00 >> Proportion Var 0.25 0.25 0.25 0.25 >> Cumulative Var 0.25 0.50 0.75 1.00 >> >> Any guidance will be greatly appreciated! >> >> Salvatore A. Sidoti >> PhD Student >> The Ohio State University >> Behavioral Ecology >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
David L Carlson
2016-Nov-14 16:06 UTC
[R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics
The first principal component should be your estimate of "size" since it captures the correlations between all 4 variables. The second principle component must be orthogonal to the first so that if the first is "size", the second pc is independent of size, perhaps some measure of "shape". As would be expected, the first principal component is highly correlated with the geometric mean of the three linear measurements and moderately correlated with weight:> gm <- apply(df[, -1], 1, prod)^(1/3) > pc1 <- prcomp(df, scale.=TRUE)$x[, 1] > plot(pc1, gm) > cor(cbind(pc1, gm, wgt=df$weight))pc1 gm wgt pc1 1.0000000 -0.9716317 -0.5943594 gm -0.9716317 1.0000000 0.3967369 wgt -0.5943594 0.3967369 1.0000000 ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Sidoti, Salvatore A. Sent: Sunday, November 13, 2016 7:38 PM To: Jim Lemon; r-help mailing list Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics Hi Jim, Nice to see you again! First of all, apologies to all for bending the rules a bit with respect to the mailing list. I know this is a list for R programming specifically, and I have received some great advice in this regard in the past. I just thought this was an interesting applied problem that would generate some discussion about PCA in R. Yes, that is an excellent question! Indeed, why not just volume? Since this is still a work in progress and we have not published as of yet, I would rather not be more specific about the type of animal at this time ;>}. Nonetheless, I can say that the animals I study change "size" depending on their feeding and hydration state. The abdomen in particular undergoes drastic size changes. That being said, there are key anatomical features that remain fixed in the adult. Now, there *might* be a way to work volume into the PCA. Although volume is not a reliable metric since the abdomen size is so changeable while the animal is alive, but what about preserved specimens? I have many that have been marinating in ethanol for months. Wouldn't the tissues have equilibrated by now? Probably... I could measure volume by displacement or suspension, I suppose. In the meantime, here's a few thoughts: 1) Use the contribution % (known as C% hereafter) of each variable on principle components 1 and 2. 2) The total contribution of a variable that explains the variations retained by PC1 an PC2 is calculated by: sum(C%1 * eigenvalue1, C%2 * eigenvalue2) 3) Scale() to mean-center the columns of the data set. 4) Use these total contributions as the weights of an arithmetic mean. For example, we have an animal with the following data (mean-centered): weight: 1.334 interoc: -0.225 clength: 0.046 cwidth: -0.847 The contributions of these variables on PC1 and PC2 are (% changed to proportions): weight: 0.556 interoc: 0.357 clength: 0.493 cwidth: 0.291 To calculate size: 1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758 Then divide by the sum of the weights: 0.43758 / 1.697 = 0.257855 = "animal size" This value can then be used to rank the animal according to its size for further analysis... Does this sound like a reasonable application of my PCA data? Salvatore A. Sidoti PhD Student Behavioral Ecology -----Original Message----- From: Jim Lemon [mailto:drjimlemon at gmail.com] Sent: Sunday, November 13, 2016 3:53 PM To: Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu>; r-help mailing list <r-help at r-project.org> Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics Hi Salvatore, If by "size" you mean volume, why not directly measure the volume of your animals? They appear to be fairly small. Sometimes working out what the critical value actually means can inform the way to measure it. Jim On Sun, Nov 13, 2016 at 4:46 PM, Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu> wrote:> Let's say I perform 4 measurements on an animal: three are linear measurements in millimeters and the fourth is its weight in milligrams. So, we have a data set with mixed units. > > Based on these four correlated measurements, I would like to obtain one "score" or value that describes an individual animal's size. I considered simply taking the geometric mean of these 4 measurements, and that would give me a "score" - larger values would be for larger animals, etc. > > However, this assumes that all 4 of these measurements contribute equally to an animal's size. Of course, more than likely this is not the case. I then performed a PCA to discover how much influence each variable had on the overall data set. I was hoping to use this analysis to refine my original approach. > > I honestly do not know how to apply the information from the PCA to this particular problem... > > I do know, however, that principle components 1 and 2 capture enough of the variation to reduce the number of dimensions down to 2 (see analysis below with the original data set). > > Note: animal weights were ln() transformed to increase correlation with the 3 other variables. > > df <- data.frame( > weight = log(1000*c(0.0980, 0.0622, 0.0600, 0.1098, 0.0538, 0.0701, 0.1138, 0.0540, 0.0629, 0.0930, > 0.0443, 0.1115, 0.1157, 0.0734, 0.0616, 0.0640, 0.0480, 0.1339, 0.0547, 0.0844, > 0.0431, 0.0472, 0.0752, 0.0604, 0.0713, 0.0658, 0.0538, 0.0585, 0.0645, 0.0529, > 0.0448, 0.0574, 0.0577, 0.0514, 0.0758, 0.0424, 0.0997, 0.0758, 0.0649, 0.0465, > 0.0748, 0.0540, 0.0819, 0.0732, 0.0725, 0.0730, 0.0777, 0.0630, 0.0466)), > interoc = c(0.853, 0.865, 0.811, 0.840, 0.783, 0.868, 0.818, 0.847, 0.838, 0.799, > 0.737, 0.788, 0.731, 0.777, 0.863, 0.877, 0.814, 0.926, 0.767, 0.746, > 0.700, 0.768, 0.807, 0.753, 0.809, 0.788, 0.750, 0.815, 0.757, 0.737, > 0.759, 0.863, 0.747, 0.838, 0.790, 0.676, 0.857, 0.728, 0.743, 0.870, > 0.787, 0.773, 0.829, 0.785, 0.746, 0.834, 0.829, 0.750, 0.842), > cwidth = c(3.152, 3.046, 3.139, 3.181, 3.023, 3.452, 2.803, 3.050, 3.160, 3.186, > 2.801, 2.862, 3.183, 2.770, 3.207, 3.188, 2.969, 3.033, 2.972, 3.291, > 2.772, 2.875, 2.978, 3.094, 2.956, 2.966, 2.896, 3.149, 2.813, 2.935, > 2.839, 3.152, 2.984, 3.037, 2.888, 2.723, 3.342, 2.562, 2.827, 2.909, > 3.093, 2.990, 3.097, 2.751, 2.877, 2.901, 2.895, 2.721, 2.942), > clength = c(3.889, 3.733, 3.762, 4.059, 3.911, 3.822, 3.768, 3.814, 3.721, 3.794, > 3.483, 3.863, 3.856, 3.457, 3.996, 3.876, 3.642, 3.978, 3.534, 3.967, > 3.429, 3.518, 3.766, 3.755, 3.706, 3.785, 3.607, 3.922, 3.453, 3.589, > 3.508, 3.861, 3.706, 3.593, 3.570, 3.341, 3.916, 3.336, 3.504, 3.688, > 3.735, 3.724, 3.860, 3.405, 3.493, 3.586, 3.545, 3.443, > 3.640)) > > pca_morpho <- princomp(df, cor = TRUE) > > summary(pca_morpho) > > Importance of components: > Comp.1 Comp.2 Comp.3 Comp.4 > Standard deviation 1.604107 0.8827323 0.7061206 0.3860275 > Proportion of Variance 0.643290 0.1948041 0.1246516 0.0372543 > Cumulative Proportion 0.643290 0.8380941 0.9627457 1.0000000 > > Loadings: > Comp.1 Comp.2 Comp.3 Comp.4 > weight -0.371 0.907 -0.201 > interoc -0.486 -0.227 -0.840 > cwidth -0.537 -0.349 0.466 -0.611 > clength -0.582 0.278 0.761 > > Comp.1 Comp.2 Comp.3 Comp.4 > SS loadings 1.00 1.00 1.00 1.00 > Proportion Var 0.25 0.25 0.25 0.25 > Cumulative Var 0.25 0.50 0.75 1.00 > > Any guidance will be greatly appreciated! > > Salvatore A. Sidoti > PhD Student > The Ohio State University > Behavioral Ecology > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sidoti, Salvatore A.
2016-Nov-14 17:41 UTC
[R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics
Fascinating! So it appears that I can simply take the geometric mean of all 4
metrics (unscaled), including weight, then designate that value as a relative
measure of "size" within my sample population. The justification for
using the geometric mean is shown by the high correlation between PC1 and the
size values:
pc1 gm
pc1 1.0000000 -0.8458024
gm -0.8458024 1.0000000
Pearson's product-moment correlation
data: pc1 and gm
t = -10.869, df = 47, p-value = 2.032e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9104585 -0.7407939
sample estimates:
cor
-0.8458024
Salvatore A. Sidoti
PhD Student
Behavioral Ecology
-----Original Message-----
From: David L Carlson [mailto:dcarlson at tamu.edu]
Sent: Monday, November 14, 2016 11:07 AM
To: Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu>; Jim Lemon
<drjimlemon at gmail.com>; r-help mailing list <r-help at
r-project.org>
Subject: RE: [R] Principle Component Analysis: Ranking Animal Size Based On
Combined Metrics
The first principal component should be your estimate of "size" since
it captures the correlations between all 4 variables. The second principle
component must be orthogonal to the first so that if the first is
"size", the second pc is independent of size, perhaps some measure of
"shape". As would be expected, the first principal component is highly
correlated with the geometric mean of the three linear measurements and
moderately correlated with weight:
> gm <- apply(df[, -1], 1, prod)^(1/3)
> pc1 <- prcomp(df, scale.=TRUE)$x[, 1]
> plot(pc1, gm)
> cor(cbind(pc1, gm, wgt=df$weight))
pc1 gm wgt
pc1 1.0000000 -0.9716317 -0.5943594
gm -0.9716317 1.0000000 0.3967369
wgt -0.5943594 0.3967369 1.0000000
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Sidoti,
Salvatore A.
Sent: Sunday, November 13, 2016 7:38 PM
To: Jim Lemon; r-help mailing list
Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On
Combined Metrics
Hi Jim,
Nice to see you again! First of all, apologies to all for bending the rules a
bit with respect to the mailing list. I know this is a list for R programming
specifically, and I have received some great advice in this regard in the past.
I just thought this was an interesting applied problem that would generate some
discussion about PCA in R.
Yes, that is an excellent question! Indeed, why not just volume? Since this is
still a work in progress and we have not published as of yet, I would rather not
be more specific about the type of animal at this time ;>}. Nonetheless, I
can say that the animals I study change "size" depending on their
feeding and hydration state. The abdomen in particular undergoes drastic size
changes. That being said, there are key anatomical features that remain fixed in
the adult.
Now, there *might* be a way to work volume into the PCA. Although volume is not
a reliable metric since the abdomen size is so changeable while the animal is
alive, but what about preserved specimens? I have many that have been marinating
in ethanol for months. Wouldn't the tissues have equilibrated by now?
Probably... I could measure volume by displacement or suspension, I suppose.
In the meantime, here's a few thoughts:
1) Use the contribution % (known as C% hereafter) of each variable on principle
components 1 and 2.
2) The total contribution of a variable that explains the variations retained
by PC1 an PC2 is calculated by:
sum(C%1 * eigenvalue1, C%2 * eigenvalue2)
3) Scale() to mean-center the columns of the data set.
4) Use these total contributions as the weights of an arithmetic mean.
For example, we have an animal with the following data (mean-centered):
weight: 1.334
interoc: -0.225
clength: 0.046
cwidth: -0.847
The contributions of these variables on PC1 and PC2 are (% changed to
proportions):
weight: 0.556
interoc: 0.357
clength: 0.493
cwidth: 0.291
To calculate size:
1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758 Then divide
by the sum of the weights:
0.43758 / 1.697 = 0.257855 = "animal size"
This value can then be used to rank the animal according to its size for further
analysis...
Does this sound like a reasonable application of my PCA data?
Salvatore A. Sidoti
PhD Student
Behavioral Ecology
-----Original Message-----
From: Jim Lemon [mailto:drjimlemon at gmail.com]
Sent: Sunday, November 13, 2016 3:53 PM
To: Sidoti, Salvatore A. <sidoti.23 at buckeyemail.osu.edu>; r-help
mailing list <r-help at r-project.org>
Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On
Combined Metrics
Hi Salvatore,
If by "size" you mean volume, why not directly measure the volume of
your animals? They appear to be fairly small. Sometimes working out what the
critical value actually means can inform the way to measure it.
Jim
On Sun, Nov 13, 2016 at 4:46 PM, Sidoti, Salvatore A.
<sidoti.23 at buckeyemail.osu.edu> wrote:> Let's say I perform 4 measurements on an animal: three are linear
measurements in millimeters and the fourth is its weight in milligrams. So, we
have a data set with mixed units.
>
> Based on these four correlated measurements, I would like to obtain one
"score" or value that describes an individual animal's size. I
considered simply taking the geometric mean of these 4 measurements, and that
would give me a "score" - larger values would be for larger animals,
etc.
>
> However, this assumes that all 4 of these measurements contribute equally
to an animal's size. Of course, more than likely this is not the case. I
then performed a PCA to discover how much influence each variable had on the
overall data set. I was hoping to use this analysis to refine my original
approach.
>
> I honestly do not know how to apply the information from the PCA to this
particular problem...
>
> I do know, however, that principle components 1 and 2 capture enough of the
variation to reduce the number of dimensions down to 2 (see analysis below with
the original data set).
>
> Note: animal weights were ln() transformed to increase correlation with the
3 other variables.
>
> df <- data.frame(
> weight = log(1000*c(0.0980, 0.0622, 0.0600, 0.1098, 0.0538, 0.0701,
0.1138, 0.0540, 0.0629, 0.0930,
> 0.0443, 0.1115, 0.1157, 0.0734, 0.0616, 0.0640, 0.0480,
0.1339, 0.0547, 0.0844,
> 0.0431, 0.0472, 0.0752, 0.0604, 0.0713, 0.0658, 0.0538,
0.0585, 0.0645, 0.0529,
> 0.0448, 0.0574, 0.0577, 0.0514, 0.0758, 0.0424, 0.0997,
0.0758, 0.0649, 0.0465,
> 0.0748, 0.0540, 0.0819, 0.0732, 0.0725, 0.0730, 0.0777,
0.0630, 0.0466)),
> interoc = c(0.853, 0.865, 0.811, 0.840, 0.783, 0.868, 0.818, 0.847,
0.838, 0.799,
> 0.737, 0.788, 0.731, 0.777, 0.863, 0.877, 0.814, 0.926,
0.767, 0.746,
> 0.700, 0.768, 0.807, 0.753, 0.809, 0.788, 0.750, 0.815,
0.757, 0.737,
> 0.759, 0.863, 0.747, 0.838, 0.790, 0.676, 0.857, 0.728,
0.743, 0.870,
> 0.787, 0.773, 0.829, 0.785, 0.746, 0.834, 0.829, 0.750,
0.842),
> cwidth = c(3.152, 3.046, 3.139, 3.181, 3.023, 3.452, 2.803, 3.050, 3.160,
3.186,
> 2.801, 2.862, 3.183, 2.770, 3.207, 3.188, 2.969, 3.033, 2.972,
3.291,
> 2.772, 2.875, 2.978, 3.094, 2.956, 2.966, 2.896, 3.149, 2.813,
2.935,
> 2.839, 3.152, 2.984, 3.037, 2.888, 2.723, 3.342, 2.562, 2.827,
2.909,
> 3.093, 2.990, 3.097, 2.751, 2.877, 2.901, 2.895, 2.721,
2.942),
> clength = c(3.889, 3.733, 3.762, 4.059, 3.911, 3.822, 3.768, 3.814,
3.721, 3.794,
> 3.483, 3.863, 3.856, 3.457, 3.996, 3.876, 3.642, 3.978,
3.534, 3.967,
> 3.429, 3.518, 3.766, 3.755, 3.706, 3.785, 3.607, 3.922,
3.453, 3.589,
> 3.508, 3.861, 3.706, 3.593, 3.570, 3.341, 3.916, 3.336,
3.504, 3.688,
> 3.735, 3.724, 3.860, 3.405, 3.493, 3.586, 3.545, 3.443,
> 3.640))
>
> pca_morpho <- princomp(df, cor = TRUE)
>
> summary(pca_morpho)
>
> Importance of components:
> Comp.1 Comp.2
Comp.3 Comp.4
> Standard deviation 1.604107 0.8827323 0.7061206
0.3860275
> Proportion of Variance 0.643290 0.1948041 0.1246516
0.0372543
> Cumulative Proportion 0.643290 0.8380941 0.9627457
1.0000000
>
> Loadings:
> Comp.1 Comp.2 Comp.3 Comp.4
> weight -0.371 0.907 -0.201
> interoc -0.486 -0.227 -0.840
> cwidth -0.537 -0.349 0.466 -0.611
> clength -0.582 0.278 0.761
>
> Comp.1 Comp.2 Comp.3 Comp.4
> SS loadings 1.00 1.00 1.00
1.00
> Proportion Var 0.25 0.25 0.25
0.25
> Cumulative Var 0.25 0.50 0.75
1.00
>
> Any guidance will be greatly appreciated!
>
> Salvatore A. Sidoti
> PhD Student
> The Ohio State University
> Behavioral Ecology
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.