Sherri Rose
2010-Feb-28 00:07 UTC
[R] Combining 2 columns into 1 column many times in a very large dataset
*Combining 2 columns into 1 column many times in a very large dataset* The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be efficient. I've searched the R help files and the archives for this list and have some possible workable solutions for 2) and 3) but not my question 1). However, I include 2) and 3) in case anyone has recommendations that would be efficient. Here is a toy example of the data structure: pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5), age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2), rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2), rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2), rs132.1=c(4,4,4,4,4,4,4,4,4,4)) Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123. 1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset): 11 31 11 31 31 11 11 11 31 11 2) I need to identify the least frequent SNP value (in the above example it is 31). 3) I need to replace the least frequent SNP value with 1 and the other(s) with 0. Thank you for any assistance, -S.R. [[alternative HTML version deleted]]
Phil Spector
2010-Feb-28 02:13 UTC
[R] Combining 2 columns into 1 column many times in a very large datasetB
Sherri - Here's one way: nms = c('rs123','rs157','rs132') lowf = function(one,two){ both = paste(pop[[one]],pop[[two]],sep='') tt = table(both) lowfreq = names(tt)[which.min(tt)] ifelse(both == lowfreq,1,0) } res = mapply(lowf,nms,paste(nms,'.1',sep=''),SIMPLIFY=FALSE) names(res) = paste(names(res),'_new',sep='') pop = data.frame(pop,res) It doesn't deal with the case of ties with regard to the frequency of the SNP value, but it should be easy to modify if that's an issue. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Sat, 27 Feb 2010, Sherri Rose wrote:> *Combining 2 columns into 1 column many times in a very large dataset* > > The clumsy solutions I am working on are not going to be very fast if I can > get them to work and the true dataset is ~1500 X 45000 so they need to be > efficient. I've searched the R help files and the archives for this list and > have some possible workable solutions for 2) and 3) but not my question 1). > However, I include 2) and 3) in case anyone has recommendations that would > be efficient. > > Here is a toy example of the data structure: > pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5), > age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2), > rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), > rs157=c(2,4,2,2,2,4,4,4,2,2), > rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2), > rs132.1=c(4,4,4,4,4,4,4,4,4,4)) > > Thus, there are a few columns of basic demographic info and then the rest of > the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and > rs123.1 is the second allele of rs123. > > 1) I need to merge all the biallelic SNP data that is currently in 2 columns > into 1 column, so, for example: rs123 and rs123.1 into one column (but > within the dataset): > 11 > 31 > 11 > 31 > 31 > 11 > 11 > 11 > 31 > 11 > 2) I need to identify the least frequent SNP value (in the above example it > is 31). > 3) I need to replace the least frequent SNP value with 1 and the other(s) > with 0. > > Thank you for any assistance, > -S.R. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >