Allan Strand
2001-Oct-06 17:45 UTC
[R] calculating DNA mismatch distributions for large populations
Hi all, I am interested in calculating and displaying the distributions of pairwise comparisons between DNA sequences in populations. The comparisons are the number of nucleotide sites that differ between the two sequences (mismatches). My sequences are stored in a vector of strings. There is an additional vector of the same length that provides the indices to the DNA sequences. Finally, I have output from table() that gives the distribution of indices in the population. It looks something like this: seqindex 1,2,3,4,5 sequences "ATG","ATC","AAT","CTT","GTC" table() output of seqindex frequencies: 1 2 3 4 5 6 10 5 1 7 The sequences are usually around 100 characters long (rather than 3) and there can be several hundred different unique sequences and the freqencies of sequence indices may total many thousands (often 30,000). Right now, I am trying the following approach: 1) I create a 'distance matrix' of mismatches. I have been calculating the mismatches using the following approach (states() creates a list of two vectors: sl$aindex==indices and sl$state==sequences) sl<-states(lnum,Rland); rmat<-matrix(0,nrow=length(sl[[1]]),ncol=length(sl[[1]])); for (i in 1:length(sl[[1]])) for (j in i:length(sl[[1]])) { if (i!=j) { vi<-strsplit(sl$state[[i]],NULL)[[1]] vj<-strsplit(sl$state[[j]],NULL)[[1]] rmat[j,i]<-length(vi)-sum(vi==vj); rmat[j,i]->rmat[i,j]; } } rmat this works pretty well, although it is not blindingly fast. 2) I have then tried this approach to get the distribution of mismatches for the entire population: I have tried to concatenate X copies of rows of rmat corresponding to a particular sequence index. I'm using "rep(rmat[seqindex,],X)". Here, X is the the frequency of that sequence index (and sequence) in the population. I try then to repeat this for every sequence index present in the pop. My plan was to then run table() on the resulting vector. As I suspected, trying this with real data brought my 450Mhz/128Meg RAM box to its knees and I never had the patience to wait for the finished product. This effect is due to large amounts of memory usage. My question is: What is a more efficient (in terms of memory and/or speed) way to solve this problem. Clearly, this partially brain-dead biologist has not found it. I think this question really comes down to what is the efficient way to solve part 2. One possibility that occurred to me is to use table() on each row of rmat, then multiply the resulting table by the number of times that the sequence is found in the population. I'm not clear on how to combine the resulting tables for all sequences however. Nor am I clear on whether this is an improvement on my first approach. Certainly, it would seem that this could improve memory usage. Thanks for any suggestions, a. BTW: platform i386-unknown-freebsd4.4 arch i386 os freebsd4.4 system i386, freebsd4.4 status major 1 minor 3.1 year 2001 month 08 day 31 language R -- Allan Strand, Biology http://linum.cofc.edu College of Charleston Ph. (843) 953-8085 Charleston, SC 29424 Fax (843) 953-5453 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._