Cheyenne Yancey Payne
2018-Mar-15 00:11 UTC
[R] stats 'dist' euclidean distance calculation
Hello, I am working with a matrix of multilocus genotypes for ~180 individual snail samples, with substantial missing data. I am trying to calculate the pairwise genetic distance between individuals using the stats package 'dist' function, using euclidean distance. I took a subset of this dataset (3 samples x 3 loci) to test how euclidean distance is calculated: 3x3 subset used Locus1 Locus2 Locus3 Samp1 GG <NA> GG Samp2 AG CA GA Samp3 AG CA GG The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) My assumption was that the difference between x_i and y_i would be the number of allelic differences at each base pair site between samples. For example, the euclidean distance between Samp1 and Samp2 would be: euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 - S2_L3)^2 ) at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1 at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4 at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1 euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940 Calculating euclidean distances this way, the distance matrix should be: # Samp1 Samp2 Samp3 # Samp1 0.000000 2.449400 2.236068 # Samp2 2.449400 0.000000 1.000000 # Samp3 2.236068 1.000000 0.000000 However, this distance matrix differs from the one calculated by the R stats package 'dist' function: # Samp1 Samp2 Samp3 # Samp1 0.000000 3.478652 2.659285 # Samp2 3.478652 0.000000 2.124787 # Samp3 2.659285 2.124787 0.000000 I used the following code (with intermediate output) to achieve the latter distance matrix:>>>setwd("~/Desktop/R_stuff") ### Data Prep: load and collect info from matrix file infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt' Mydata <- read.table(infile, header = TRUE, check.names = FALSE) dim(Mydata) # dimensions of data.frame ind <- as.character(Mydata$sample) # individual labels population <- as.character(Mydata$region) # population labels location <- Mydata$location ### Section 1: Convert data to usable format # removes non-genotype data from matrix (i.e. lines 1-4) # subset 3 samples, 3 loci for testing SAMPS<-3 locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))] locus # Locus1 Locus2 Locus3 # Samp1 GG <NA> GG # Samp2 AG CA GA # Samp3 AG CA GG # converts geno matrix to genind object (adegenet) Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop = population[1:SAMPS], sep="") Mydata1$tab # get stats on newly created genind object # Locus1.G Locus1.A Locus2.C Locus2.A Locus3.G Locus3.A # Samp1 2 0 NA NA 2 0 # Samp2 1 1 1 1 1 1 # Samp3 1 1 1 1 2 0 ### Section 2: Individual genetic distance: euclidean distance (dist {adegenet}) # Test dist() function # collect euclidean genetic distances distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper = FALSE, p = 2) distgenEUCL # Samp1 Samp2 # Samp2 2.449490 # Samp3 1.732051 1.414214 x<-as.matrix(dist(distgenEUCL)) x # Samp1 Samp2 Samp3 # Samp1 0.000000 3.478652 2.659285 # Samp2 3.478652 0.000000 2.124787 # Samp3 2.659285 2.124787 0.000000>>>I have checked several forums and have been unable to resolve this discrepancy. Any and all help will be much appreciated! Thank you! Cheyenne Cheyenne Payne Bioinformatics Technician Palumbi Lab | Hopkins Marine Station (714) 200-5897 [[alternative HTML version deleted]]
Hi Cheyenne, I noticed one thing that might be helpful to you. First, I took a shortcut to the case of interest:> m <- matrix(c(2,1,1,0,1,1,NA,1,1,NA,1,1,2,1,2,0,1,0),nrow=3) > colnames(m) <- c("1.G","1.A","2.C","2.A","3.G","3.A") > m# 1.G 1.A 2.C 2.A 3.G 3.A # [1,] 2 0 NA NA 2 0 # [2,] 1 1 1 1 1 1 # [3,] 1 1 1 1 2 0 Computing the distance between the different rows by hand - TREATING THE NA's AS ZERO - would give: dist(row1,row2) = sqrt( 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6) = 2.45 dist(row1,row3) = sqrt( 1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 0^2) = sqrt(4) = 2 dist(row2,row3) = sqrt( 0^2 + 0^2 + 0^2 + 0^2 + 1^2 + 1^2) = sqrt(2) 1.414 Doing the same calculation with the dist() function gives> dist(m)# 1 2 # 2 2.45 # 3 1.73 1.414 i.e. the results match with the manual calculation for dist(row1,row2) and dist(row2,row3). However for dist(row1,row3) which should be 2, the dist function gives 1.73 = sqrt(3). Clearly sqrt(3) makes no sense since |1 - NA|^2 appears twice. Either both times it should get a value of 1 or neither time. Why only once? Not clear to me and I did not see any hints on ?dist. However if you replace the NA's by actual 0's, which seems to be your preferred methodology, then the problem is "solved", i.e.> m2 <- m > m2[1,][is.na(m2[1,])] <- 0 > dist(m2)# 1 2 # 2 2.45 # 3 2 1.414 HTH, Eric On Thu, Mar 15, 2018 at 2:11 AM, Cheyenne Yancey Payne <cypayne at stanford.edu> wrote:> Hello, > > I am working with a matrix of multilocus genotypes for ~180 individual > snail samples, with substantial missing data. I am trying to calculate the > pairwise genetic distance between individuals using the stats package > 'dist' function, using euclidean distance. I took a subset of this dataset > (3 samples x 3 loci) to test how euclidean distance is calculated: > > 3x3 subset used > Locus1 Locus2 Locus3 > Samp1 GG <NA> GG > Samp2 AG CA GA > Samp3 AG CA GG > > The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) > My assumption was that the difference between x_i and y_i would be the > number of allelic differences at each base pair site between samples. For > example, the euclidean distance between Samp1 and Samp2 would be: > > euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 - > S2_L3)^2 ) > at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1 > at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4 > at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1 > > euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940 > > Calculating euclidean distances this way, the distance matrix should be: > # Samp1 Samp2 Samp3 > # Samp1 0.000000 2.449400 2.236068 > # Samp2 2.449400 0.000000 1.000000 > # Samp3 2.236068 1.000000 0.000000 > > However, this distance matrix differs from the one calculated by the R > stats package 'dist' function: > # Samp1 Samp2 Samp3 > # Samp1 0.000000 3.478652 2.659285 > # Samp2 3.478652 0.000000 2.124787 > # Samp3 2.659285 2.124787 0.000000 > > I used the following code (with intermediate output) to achieve the latter > distance matrix: > > > >>> > setwd("~/Desktop/R_stuff") > > ### Data Prep: load and collect info from matrix file > infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt' > Mydata <- read.table(infile, header = TRUE, check.names = FALSE) > dim(Mydata) # dimensions of data.frame > ind <- as.character(Mydata$sample) # individual labels > population <- as.character(Mydata$region) # population labels > location <- Mydata$location > > ### Section 1: Convert data to usable format > # removes non-genotype data from matrix (i.e. lines 1-4) > # subset 3 samples, 3 loci for testing > SAMPS<-3 > locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))] > locus > # Locus1 Locus2 Locus3 > # Samp1 GG <NA> GG > # Samp2 AG CA GA > # Samp3 AG CA GG > > # converts geno matrix to genind object (adegenet) > Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop > population[1:SAMPS], sep="") > Mydata1$tab # get stats on newly created genind object > # Locus1.G Locus1.A Locus2.C Locus2.A > Locus3.G Locus3.A > # Samp1 2 0 NA > NA 2 0 > # Samp2 1 1 1 > 1 1 1 > # Samp3 1 1 1 > 1 2 0 > > ### Section 2: Individual genetic distance: euclidean distance (dist > {adegenet}) > > # Test dist() function > # collect euclidean genetic distances > distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper > FALSE, p = 2) > distgenEUCL > # Samp1 Samp2 > # Samp2 2.449490 > # Samp3 1.732051 1.414214 > > x<-as.matrix(dist(distgenEUCL)) > x > # Samp1 Samp2 Samp3 > # Samp1 0.000000 3.478652 2.659285 > # Samp2 3.478652 0.000000 2.124787 > # Samp3 2.659285 2.124787 0.000000 > >>> > > > > I have checked several forums and have been unable to resolve this > discrepancy. Any and all help will be much appreciated! > > > Thank you! > > Cheyenne > > > Cheyenne Payne > > Bioinformatics Technician > > Palumbi Lab | Hopkins Marine Station > > (714) 200-5897 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
> 3x3 subset used > Locus1 Locus2 Locus3 > Samp1 GG <NA> GG > Samp2 AG CA GA > Samp3 AG CA GG > > The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) My > assumption was that the difference between x_i and y_i would be the number > of allelic differences at each base pair site between samples.Base R does not share your assumption, which (from a general purpose stats point of view) would be a completely outlandish interpretation of the data. As far as base R is concerned, these are just arbitrary character strings represented (by default) as factors. Since factors are, internally, integers assigned (by default) in increasing lexical order to the levels present, if you apply dist() to factors constructed from allele data, you will usually get complete nonsense in genetic terms. You should probably look at something like dist.gene in the ape package: see https://www.rdocumentation.org/packages/ape/versions/5.0/topics/dist.gene S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
.... and I believe this whole thread may fit better at the Bioconductor list rather than here. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Thu, Mar 15, 2018 at 5:11 AM, S Ellison <S.Ellison at lgcgroup.com> wrote:> > 3x3 subset used > > Locus1 Locus2 Locus3 > > Samp1 GG <NA> GG > > Samp2 AG CA GA > > Samp3 AG CA GG > > > > The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) > My > > assumption was that the difference between x_i and y_i would be the > number > > of allelic differences at each base pair site between samples. > > Base R does not share your assumption, which (from a general purpose stats > point of view) would be a completely outlandish interpretation of the data. > As far as base R is concerned, these are just arbitrary character strings > represented (by default) as factors. Since factors are, internally, > integers assigned (by default) in increasing lexical order to the levels > present, if you apply dist() to factors constructed from allele data, you > will usually get complete nonsense in genetic terms. > > You should probably look at something like dist.gene in the ape package: > see > https://www.rdocumentation.org/packages/ape/versions/5.0/topics/dist.gene > > S Ellison > > > ******************************************************************* > This email and any attachments are confidential. Any u...{{dropped:13}}